2004-06-04 02:07:55

by Ed Tomlinson

[permalink] [raw]
Subject: ide errors in 7-rc1-mm1 and later

Hi,

I am still getting these ide errors with 7-rc2-mm2. I get the errors even
if I mount with barrier=0 (or just defaults). It would seem that something is
sending my drive commands it does not understand...

May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }

How can we find out what is wrong?

This does not seem to be an error that corrupts the fs, it just slows things
down when it hits a group of these. Note that they keep poping up - they
do stop (I still get them hours after booting).

TIA
Ed Tomlinson

----------------------
7-mm4 ok
7-mm5 na
7-rc1-mm1 errors
7-rc2 ok
7-rc2-mm2 errors

CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y

#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
# CONFIG_IDEDISK_STROKE is not set
CONFIG_BLK_DEV_IDECD=m
CONFIG_BLK_DEV_IDETAPE=m
# CONFIG_BLK_DEV_IDEFLOPPY is not set
CONFIG_BLK_DEV_IDESCSI=m
# CONFIG_IDE_TASK_IOCTL is not set
CONFIG_IDE_TASKFILE_IO=y

#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
CONFIG_BLK_DEV_IDEPNP=y
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
# CONFIG_BLK_DEV_OFFBOARD is not set
# CONFIG_BLK_DEV_GENERIC is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
CONFIG_BLK_DEV_ADMA=y
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
# CONFIG_IDE_CHIPSETS is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
CONFIG_IDEDMA_AUTO=y

> Think this is not just a barrier problem (unless barrier is the default).
> One if my two drives gets the error below during operation.
> The drive is the root drive and is mounted with defaults. 2.6.6-mm4
> was the last kernel booted on this box. The 2.6.7-rc1-mm1 was compiled
> with 2.95 with the following fs options:
>
> CONFIG_EXT2_FS=y
> # CONFIG_EXT2_FS_XATTR is not set
> CONFIG_EXT3_FS=m
> # CONFIG_EXT3_FS_XATTR is not set
> CONFIG_JBD=m
> # CONFIG_JBD_DEBUG is not set
> CONFIG_REISERFS_FS=y
> # CONFIG_REISERFS_CHECK is not set
> # CONFIG_REISERFS_PROC_INFO is not set
> # CONFIG_REISERFS_FS_XATTR is not set
> # CONFIG_JFS_FS is not set
> # CONFIG_XFS_FS is not set
> # CONFIG_MINIX_FS is not set
> # CONFIG_ROMFS_FS is not set
> # CONFIG_QUOTA is not set
> # CONFIG_AUTOFS_FS is not set
> CONFIG_AUTOFS4_FS=m

Disk /dev/hda: 6448 MB, 6448619520 bytes
240 heads, 63 sectors/track, 833 cylinders
Units = cylinders of 15120 * 512 = 7741440 bytes

Device Boot Start End Blocks Id System
/dev/hda1 1 99 748408+ 82 Linux swap
/dev/hda2 100 108 68040 83 Linux
/dev/hda3 * 109 833 5481000 83 Linux

> hda reports:
> root@bert:/usr/src/linux# hdparm -iI /dev/hda
>
> /dev/hda:
>
> Model=WDC AC26400R, FwRev=15.01J15, SerialNo=WD-WM6271600165
> Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
> RawCHS=13328/15/63, TrkSize=57600, SectSize=600, ECCbytes=40
> BuffType=DualPortCache, BuffSize=512kB, MaxMultSect=16, MultSect=16
> CurCHS=13328/15/63, CurSects=12594960, LBA=yes, LBAsects=12594960
> IORDY=on/off, tPIO={min:160,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2
> UDMA modes: udma0 udma1 *udma2 udma3 udma4
> AdvancedPM=no WriteCache=enabled
> Drive conforms to: device does not report version: 1 2 3 4
>
> * signifies the current active mode
>
>
> ATA device, with non-removable media
> Model Number: WDC AC26400R
> Serial Number: WD-WM6271600165
> Firmware Revision: 15.01J15
> Standards:
> Supported: 4 3 2 1
> Likely used: 4
> Configuration:
> Logical max current
> cylinders 13328 13328
> heads 15 15
> sectors/track 63 63
> --
> bytes/track: 57600 bytes/sector: 600
> CHS current addressable sectors: 12594960
> LBA user addressable sectors: 12594960
> device size with M = 1024*1024: 6149 MBytes
> device size with M = 1000*1000: 6448 MBytes (6 GB)
> Capabilities:
> LBA, IORDY(can be disabled)
> Buffer size: 512.0kB bytes avail on r/w long: 40 Queue depth: 1
> Standby timer values: spec'd by Standard, no device specific minimum
> R/W multiple sector transfer: Max = 16 Current = 16
> DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4
> Cycle time: min=120ns recommended=120ns
> PIO: pio0 pio1 pio2 pio3 pio4
> Cycle time: no flow control=160ns IORDY flow control=120ns
> Commands/features:
> Enabled Supported:
> * READ BUFFER cmd
> * WRITE BUFFER cmd
> * Look-ahead
> * Write cache
> * Power Management feature set
> * SMART feature set
>
> root@bert:/usr/src/linux# hdparm -iI /dev/hdb
>
> /dev/hdb:
>
> Model=Maxtor 6E030L0, FwRev=NAR61590, SerialNo=E178CV5E
> Config={ Fixed }
> RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
> BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
> CurCHS=17475/15/63, CurSects=16513875, LBA=yes, LBAsects=60058656
> IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2
> UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5 udma6
> AdvancedPM=yes: disabled (255) WriteCache=enabled
> Drive conforms to: (null):
>
> * signifies the current active mode
>
>
> ATA device, with non-removable media
> Model Number: Maxtor 6E030L0
> Serial Number: E178CV5E
> Firmware Revision: NAR61590
> Standards:
> Supported: 7 6 5 4
> Likely used: 7
> Configuration:
> Logical max current
> cylinders 16383 17475
> heads 16 15
> sectors/track 63 63
> --
> CHS current addressable sectors: 16513875
> LBA user addressable sectors: 60058656
> device size with M = 1024*1024: 29325 MBytes
> device size with M = 1000*1000: 30750 MBytes (30 GB)
> Capabilities:
> LBA, IORDY(can be disabled)
> Queue depth: 1
> Standby timer values: spec'd by Standard, no device specific minimum
> R/W multiple sector transfer: Max = 16 Current = 16
> Advanced power management level: unknown setting (0x0000)
> Recommended acoustic management value: 192, current value: 254
> DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 udma6
> Cycle time: min=120ns recommended=120ns
> PIO: pio0 pio1 pio2 pio3 pio4
> Cycle time: no flow control=120ns IORDY flow control=120ns
> Commands/features:
> Enabled Supported:
> * NOP cmd
> * READ BUFFER cmd
> * WRITE BUFFER cmd
> * Host Protected Area feature set
> * Look-ahead
> * Write cache
> * Power Management feature set
> Security Mode feature set
> * SMART feature set
> * FLUSH CACHE EXT command
> * Mandatory FLUSH CACHE command
> * Device Configuration Overlay feature set
> * Automatic Acoustic Management feature set
> SET MAX security extension
> Advanced Power Management feature set
> * DOWNLOAD MICROCODE cmd
> * SMART self-test
> * SMART error logging
> Security:
> Master password revision code = 65534
> supported
> not enabled
> not locked
> not frozen
> not expired: security count
> not supported: enhanced erase
> HW reset results:
> CBLID- above Vih
> Device num = 1 determined by CSEL
> Checksum: correct
>
> hdb is accessed via dm and evms. This is what the boot of reports:
>
> May 27 18:17:39 bert kernel: Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> May 27 18:17:39 bert kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> May 27 18:17:39 bert kernel: PIIX4: IDE controller at PCI slot 0000:00:14.1
> May 27 18:17:39 bert kernel: PIIX4: chipset revision 1
> May 27 18:17:39 bert kernel: PIIX4: not 100%% native mode: will probe irqs later
> May 27 18:17:39 bert kernel: ide0: BM-DMA at 0x10c0-0x10c7, BIOS settings: hda:pio, hdb:DMA
> May 27 18:17:39 bert kernel: ide1: BM-DMA at 0x10c8-0x10cf, BIOS settings: hdc:DMA, hdd:pio
> May 27 18:17:39 bert kernel: hda: WDC AC26400R, ATA DISK drive
> May 27 18:17:39 bert kernel: hdb: Maxtor 6E030L0, ATA DISK drive
> May 27 18:17:39 bert kernel: Using anticipatory io scheduler
> May 27 18:17:39 bert kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> May 27 18:17:39 bert kernel: hdc: HL-DT-ST RW/DVD GCC-4480B, ATAPI CD/DVD-ROM drive
> May 27 18:17:39 bert kernel: ide1 at 0x170-0x177,0x376 on irq 15
> May 27 18:17:39 bert kernel: pnp: the driver 'ide' has been registered
> May 27 18:17:39 bert kernel: hda: max request size: 128KiB
> May 27 18:17:39 bert kernel: hda: 12594960 sectors (6448 MB) w/512KiB Cache, CHS=13328/15/63, UDMA(33)
> May 27 18:17:39 bert kernel: hda: cache flushes supported
> May 27 18:17:39 bert kernel: hda: hda1 hda2 hda3
> May 27 18:17:39 bert kernel: hdb: max request size: 128KiB
> May 27 18:17:39 bert kernel: hdb: 60058656 sectors (30750 MB) w/2048KiB Cache, CHS=59582/16/63, UDMA(33)
> May 27 18:17:39 bert kernel: hdb: cache flushes supported
> May 27 18:17:39 bert kernel: hdb: hdb1 hdb2 hdb3 hdb4 < hdb5 >
>
> followed later by:
>
> May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> May 27 18:18:06 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:18:06 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> May 27 18:19:21 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:19:21 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> May 27 18:19:22 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:19:22 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> May 27 18:20:01 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:20:01 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> May 27 18:20:01 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:20:01 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> May 27 18:21:27 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:21:27 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
>
>
>
>
> Hope this help,
>
> Ed
>
> On May 27, 2004 04:24 pm, G?nther Persoons wrote:
> > Hey,
> > When i mount my reiser partitie with the option barrier=flush i get
> > following message and error:
> > My harddrive is a 2.5 inch Fujitsu 20GB IDE.
> >
> > mount /dev/hda10 /tmp -o barrier=flush
> > mount: wrong fs type, bad option, bad superblock on /dev/hda10,
> > or too many mounted file systems
> > Log:
> > ReiserFS: hda10: found reiserfs format "3.6" with standard journal
> > ReiserFS: hda10: using ordered data mode
> > reiserfs: using flush barriers
> > ReiserFS: hda10: journal params: device hda10, size 8192, journal first
> > block 18, max trans len 1024, max batch 900, max commit age 30, max
> > trans age 30
> > ReiserFS: hda10: checking transaction log (hda10)
> > hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > hda: drive_cmd: error=0x04 { DriveStatusError }
> > hda: barrier support doesn't work
> > ReiserFS: hda10: warning: journal-837: IO error during journal replay
> > ReiserFS: hda10: warning: Replay Failure, unable to mount
> > ReiserFS: hda10: warning: sh-2022: reiserfs_fill_super: unable to
> > initialize journal space
>


2004-06-04 02:32:42

by Andrew Morton

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Ed Tomlinson <[email protected]> wrote:
>
> Hi,
>
> I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> if I mount with barrier=0 (or just defaults). It would seem that something is
> sending my drive commands it does not understand...
>
> May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
>
> How can we find out what is wrong?
>
> This does not seem to be an error that corrupts the fs, it just slows things
> down when it hits a group of these. Note that they keep poping up - they
> do stop (I still get them hours after booting).

Jens, do we still have the command bytes available when this error hits?

2004-06-04 09:45:23

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Thu, Jun 03 2004, Andrew Morton wrote:
> Ed Tomlinson <[email protected]> wrote:
> >
> > Hi,
> >
> > I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> > if I mount with barrier=0 (or just defaults). It would seem that something is
> > sending my drive commands it does not understand...
> >
> > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> >
> > How can we find out what is wrong?
> >
> > This does not seem to be an error that corrupts the fs, it just slows things
> > down when it hits a group of these. Note that they keep poping up - they
> > do stop (I still get them hours after booting).
>
> Jens, do we still have the command bytes available when this error hits?

It's not trivial, here's a hack that should dump the offending opcode
though.

--- linux-2.6.7-rc2-mm2/drivers/ide/ide.c~ 2004-06-04 11:32:49.286777112 +0200
+++ linux-2.6.7-rc2-mm2/drivers/ide/ide.c 2004-06-04 11:41:47.338870307 +0200
@@ -438,6 +438,30 @@
#endif /* FANCY_STATUS_DUMPS */
printk("\n");
}
+ {
+ struct request *rq;
+ int opcode = 0x100;
+
+ spin_lock(&ide_lock);
+ rq = HWGROUP(drive)->rq;
+ spin_unlock(&ide_lock);
+ if (!rq)
+ goto out;
+ if (rq->flags & (REQ_DRIVE_CMD | REQ_DRIVE_TASK)) {
+ char *args = rq->buffer;
+ if (args)
+ opcode = args[0];
+ } else if (rq->flags & REQ_DRIVE_TASKFILE) {
+ ide_task_t *args = rq->special;
+ if (args) {
+ task_struct_t *tf = (task_struct_t *) args->tfRegister;
+ opcode = tf->command;
+ }
+ }
+
+ printk("ide: failed opcode was %x\n", opcode);
+ }
+out:
local_irq_restore(flags);
return err;
}

--
Jens Axboe

2004-06-04 11:23:07

by Ed Tomlinson

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On June 4, 2004 05:42 am, Jens Axboe wrote:
> On Thu, Jun 03 2004, Andrew Morton wrote:
> > Ed Tomlinson <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> > > if I mount with barrier=0 (or just defaults). It would seem that something is
> > > sending my drive commands it does not understand...
> > >
> > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > >
> > > How can we find out what is wrong?
> > >
> > > This does not seem to be an error that corrupts the fs, it just slows things
> > > down when it hits a group of these. Note that they keep poping up - they
> > > do stop (I still get them hours after booting).
> >
> > Jens, do we still have the command bytes available when this error hits?
>
> It's not trivial, here's a hack that should dump the offending opcode
> though.

Hi Jens,

I applied the patch below and booted into the new kernel (the boot message showed the
new compile time). The error messages remained the same - no extra info. Is there
another place that prints this (or (!rq) is true)?

Ideas?
Ed

> --- linux-2.6.7-rc2-mm2/drivers/ide/ide.c~ 2004-06-04 11:32:49.286777112 +0200
> +++ linux-2.6.7-rc2-mm2/drivers/ide/ide.c 2004-06-04 11:41:47.338870307 +0200
> @@ -438,6 +438,30 @@
> #endif /* FANCY_STATUS_DUMPS */
> printk("\n");
> }
> + {
> + struct request *rq;
> + int opcode = 0x100;
> +
> + spin_lock(&ide_lock);
> + rq = HWGROUP(drive)->rq;
> + spin_unlock(&ide_lock);
> + if (!rq)
> + goto out;
> + if (rq->flags & (REQ_DRIVE_CMD | REQ_DRIVE_TASK)) {
> + char *args = rq->buffer;
> + if (args)
> + opcode = args[0];
> + } else if (rq->flags & REQ_DRIVE_TASKFILE) {
> + ide_task_t *args = rq->special;
> + if (args) {
> + task_struct_t *tf = (task_struct_t *) args->tfRegister;
> + opcode = tf->command;
> + }
> + }
> +
> + printk("ide: failed opcode was %x\n", opcode);
> + }
> +out:
> local_irq_restore(flags);
> return err;
> }
>

2004-06-04 11:32:39

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Ed Tomlinson wrote:
> On June 4, 2004 05:42 am, Jens Axboe wrote:
> > On Thu, Jun 03 2004, Andrew Morton wrote:
> > > Ed Tomlinson <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> > > > if I mount with barrier=0 (or just defaults). It would seem that something is
> > > > sending my drive commands it does not understand...
> > > >
> > > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > >
> > > > How can we find out what is wrong?
> > > >
> > > > This does not seem to be an error that corrupts the fs, it just slows things
> > > > down when it hits a group of these. Note that they keep poping up - they
> > > > do stop (I still get them hours after booting).
> > >
> > > Jens, do we still have the command bytes available when this error hits?
> >
> > It's not trivial, here's a hack that should dump the offending opcode
> > though.
>
> Hi Jens,
>
> I applied the patch below and booted into the new kernel (the boot
> message showed the new compile time). The error messages remained the
> same - no extra info. Is there another place that prints this (or
> (!rq) is true)?

!rq should not be true, strange... are you sure it just doesn't to go
/var/log/messages, it should be there in dmesg. Alternatively, add a
KERN_ERR to that printk.

--
Jens Axboe

Subject: Re: ide errors in 7-rc1-mm1 and later

On Friday 04 of June 2004 11:42, Jens Axboe wrote:
> On Thu, Jun 03 2004, Andrew Morton wrote:
> > Ed Tomlinson <[email protected]> wrote:
> > > Hi,
> > >
> > > I am still getting these ide errors with 7-rc2-mm2. I get the errors
> > > even if I mount with barrier=0 (or just defaults). It would seem that
> > > something is sending my drive commands it does not understand...
> > >
> > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady
> > > SeekComplete Error } May 27 18:18:05 bert kernel: hda: drive_cmd:
> > > error=0x04 { DriveStatusError }
> > >
> > > How can we find out what is wrong?
> > >
> > > This does not seem to be an error that corrupts the fs, it just slows
> > > things down when it hits a group of these. Note that they keep poping
> > > up - they do stop (I still get them hours after booting).
> >
> > Jens, do we still have the command bytes available when this error hits?
>
> It's not trivial, here's a hack that should dump the offending opcode
> though.

I bet it is WIN_FLUSH_CACHE_EXT.

2004-06-04 11:47:11

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Jens Axboe wrote:
> On Fri, Jun 04 2004, Ed Tomlinson wrote:
> > On June 4, 2004 05:42 am, Jens Axboe wrote:
> > > On Thu, Jun 03 2004, Andrew Morton wrote:
> > > > Ed Tomlinson <[email protected]> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> > > > > if I mount with barrier=0 (or just defaults). It would seem that something is
> > > > > sending my drive commands it does not understand...
> > > > >
> > > > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > > > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > > >
> > > > > How can we find out what is wrong?
> > > > >
> > > > > This does not seem to be an error that corrupts the fs, it just slows things
> > > > > down when it hits a group of these. Note that they keep poping up - they
> > > > > do stop (I still get them hours after booting).
> > > >
> > > > Jens, do we still have the command bytes available when this error hits?
> > >
> > > It's not trivial, here's a hack that should dump the offending opcode
> > > though.
> >
> > Hi Jens,
> >
> > I applied the patch below and booted into the new kernel (the boot
> > message showed the new compile time). The error messages remained the
> > same - no extra info. Is there another place that prints this (or
> > (!rq) is true)?
>
> !rq should not be true, strange... are you sure it just doesn't to go
> /var/log/messages, it should be there in dmesg. Alternatively, add a
> KERN_ERR to that printk.

Sorry my bad, ide-disk has a private dump_status() of course. Let me
provide a new debug and possible fix, hang on.

--
Jens Axboe

Subject: Re: ide errors in 7-rc1-mm1 and later

On Friday 04 of June 2004 13:32, Jens Axboe wrote:
> On Fri, Jun 04 2004, Ed Tomlinson wrote:
> > On June 4, 2004 05:42 am, Jens Axboe wrote:
> > > On Thu, Jun 03 2004, Andrew Morton wrote:
> > > > Ed Tomlinson <[email protected]> wrote:
> > > > > Hi,
> > > > >
> > > > > I am still getting these ide errors with 7-rc2-mm2. I get the
> > > > > errors even if I mount with barrier=0 (or just defaults). It would
> > > > > seem that something is sending my drive commands it does not
> > > > > understand...
> > > > >
> > > > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 {
> > > > > DriveReady SeekComplete Error } May 27 18:18:05 bert kernel: hda:
> > > > > drive_cmd: error=0x04 { DriveStatusError }
> > > > >
> > > > > How can we find out what is wrong?
> > > > >
> > > > > This does not seem to be an error that corrupts the fs, it just
> > > > > slows things down when it hits a group of these. Note that they
> > > > > keep poping up - they do stop (I still get them hours after
> > > > > booting).
> > > >
> > > > Jens, do we still have the command bytes available when this error
> > > > hits?
> > >
> > > It's not trivial, here's a hack that should dump the offending opcode
> > > though.
> >
> > Hi Jens,
> >
> > I applied the patch below and booted into the new kernel (the boot
> > message showed the new compile time). The error messages remained the
> > same - no extra info. Is there another place that prints this (or
> > (!rq) is true)?
>
> !rq should not be true, strange... are you sure it just doesn't to go
> /var/log/messages, it should be there in dmesg. Alternatively, add a
> KERN_ERR to that printk.

Probably !rq is true.

Hint: this is what you get for playing tricks with hwrgroup->wrq
(do you now understand why it is evil?).

Cheers,
Bartlomiej

2004-06-04 12:03:00

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> On Friday 04 of June 2004 13:32, Jens Axboe wrote:
> > On Fri, Jun 04 2004, Ed Tomlinson wrote:
> > > On June 4, 2004 05:42 am, Jens Axboe wrote:
> > > > On Thu, Jun 03 2004, Andrew Morton wrote:
> > > > > Ed Tomlinson <[email protected]> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I am still getting these ide errors with 7-rc2-mm2. I get the
> > > > > > errors even if I mount with barrier=0 (or just defaults). It would
> > > > > > seem that something is sending my drive commands it does not
> > > > > > understand...
> > > > > >
> > > > > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 {
> > > > > > DriveReady SeekComplete Error } May 27 18:18:05 bert kernel: hda:
> > > > > > drive_cmd: error=0x04 { DriveStatusError }
> > > > > >
> > > > > > How can we find out what is wrong?
> > > > > >
> > > > > > This does not seem to be an error that corrupts the fs, it just
> > > > > > slows things down when it hits a group of these. Note that they
> > > > > > keep poping up - they do stop (I still get them hours after
> > > > > > booting).
> > > > >
> > > > > Jens, do we still have the command bytes available when this error
> > > > > hits?
> > > >
> > > > It's not trivial, here's a hack that should dump the offending opcode
> > > > though.
> > >
> > > Hi Jens,
> > >
> > > I applied the patch below and booted into the new kernel (the boot
> > > message showed the new compile time). The error messages remained the
> > > same - no extra info. Is there another place that prints this (or
> > > (!rq) is true)?
> >
> > !rq should not be true, strange... are you sure it just doesn't to go
> > /var/log/messages, it should be there in dmesg. Alternatively, add a
> > KERN_ERR to that printk.
>
> Probably !rq is true.

Don't think so, see my other mail. It's the crap duplicate dump_status()
functions.

> Hint: this is what you get for playing tricks with hwrgroup->wrq
> (do you now understand why it is evil?).

Oh give it up, the code is complete crap in so many places as it is.
->rq would just point to &wrq instead. Testament to how hard it is just
to provide a dump opcode (and that you missed it to) is how convoluted
it is. And almost a handful of different command types, with as many
ways to submit it.

This is not blaming you btw, just the code. You've done some nice
cleanups. The main culprit is no longer active :)

--
Jens Axboe

Subject: Re: ide errors in 7-rc1-mm1 and later

On Friday 04 of June 2004 14:01, Jens Axboe wrote:
> On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > On Friday 04 of June 2004 13:32, Jens Axboe wrote:
> > > On Fri, Jun 04 2004, Ed Tomlinson wrote:
> > > > On June 4, 2004 05:42 am, Jens Axboe wrote:
> > > > > On Thu, Jun 03 2004, Andrew Morton wrote:
> > > > > > Ed Tomlinson <[email protected]> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am still getting these ide errors with 7-rc2-mm2. I get the
> > > > > > > errors even if I mount with barrier=0 (or just defaults). It
> > > > > > > would seem that something is sending my drive commands it does
> > > > > > > not understand...
> > > > > > >
> > > > > > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 {
> > > > > > > DriveReady SeekComplete Error } May 27 18:18:05 bert kernel:
> > > > > > > hda: drive_cmd: error=0x04 { DriveStatusError }
> > > > > > >
> > > > > > > How can we find out what is wrong?
> > > > > > >
> > > > > > > This does not seem to be an error that corrupts the fs, it just
> > > > > > > slows things down when it hits a group of these. Note that
> > > > > > > they keep poping up - they do stop (I still get them hours
> > > > > > > after booting).
> > > > > >
> > > > > > Jens, do we still have the command bytes available when this
> > > > > > error hits?
> > > > >
> > > > > It's not trivial, here's a hack that should dump the offending
> > > > > opcode though.
> > > >
> > > > Hi Jens,
> > > >
> > > > I applied the patch below and booted into the new kernel (the boot
> > > > message showed the new compile time). The error messages remained
> > > > the same - no extra info. Is there another place that prints this
> > > > (or (!rq) is true)?
> > >
> > > !rq should not be true, strange... are you sure it just doesn't to go
> > > /var/log/messages, it should be there in dmesg. Alternatively, add a
> > > KERN_ERR to that printk.
> >
> > Probably !rq is true.
>
> Don't think so, see my other mail. It's the crap duplicate dump_status()
> functions.
>
> > Hint: this is what you get for playing tricks with hwrgroup->wrq
> > (do you now understand why it is evil?).
>
> Oh give it up, the code is complete crap in so many places as it is.
> ->rq would just point to &wrq instead. Testament to how hard it is just

Yep, you are right, sorry.

> to provide a dump opcode (and that you missed it to) is how convoluted
> it is. And almost a handful of different command types, with as many
> ways to submit it.
>
> This is not blaming you btw, just the code. You've done some nice
> cleanups. The main culprit is no longer active :)

Well, thanks but I still think that your patch suits crappy code perfectly
(you know all the complains).

Cheers,
Bartlomiej

2004-06-04 12:47:37

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> Well, thanks but I still think that your patch suits crappy code perfectly
> (you know all the complains).

I'm not on a crusade to clean up drivers/ide, in fact I could not care
less it if rots away (thank fully it is doing just that, pata is going
away). Most of your complaints are not valid in my opinion (->wrq usage
is fine. it's not pretty, but it's not broken as long as you serialize
access across the hwgroup of course). Like the rest, it's an artifact of
how messy the code paths are in there. That could be cleaned too
naturally, but that's someone elses job and I'm not about to increase my
work load in that area.

That you need to queue pre/post flushes to support barriers is a _driver
implementation detail_ in my opinion. You don't want to even advertise
that to upper layers. I will move a little of that into the block layer,
if only because SATA will need it as well.

As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do take
patches! If there's something you consider broken, damnit send a patch
to correct it and I'll surely merge it into the base if I agree it makes
sense. That's the way to get changes done if you feel something should
be different, snide remarks with basically zero detail is not.

--
Jens Axboe

Subject: Re: ide errors in 7-rc1-mm1 and later

On Friday 04 of June 2004 14:47, Jens Axboe wrote:
> On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > Well, thanks but I still think that your patch suits crappy code
> > perfectly (you know all the complains).
>
> I'm not on a crusade to clean up drivers/ide, in fact I could not care
> less it if rots away (thank fully it is doing just that, pata is going
> away). Most of your complaints are not valid in my opinion (->wrq usage

You are missing two facts:
- I'm on the _crusade_ to clean drivers/ide and merge them with libata later
- pata is (slowly) going away but support for it is not going _anywhere_
(although some people are smoking 'libata pata' crack)

> is fine. it's not pretty, but it's not broken as long as you serialize
> access across the hwgroup of course). Like the rest, it's an artifact of
> how messy the code paths are in there. That could be cleaned too
> naturally, but that's someone elses job and I'm not about to increase my
> work load in that area.

Yep, you prefer to increase my work load instead.

> That you need to queue pre/post flushes to support barriers is a _driver
> implementation detail_ in my opinion. You don't want to even advertise

It is implementation braindamage IMO (but I'll buy it if rest is OK).

> that to upper layers. I will move a little of that into the block layer,
> if only because SATA will need it as well.
>
> As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do take
> patches! If there's something you consider broken, damnit send a patch

It is _your_ job to do it properly.

There are no double standards, 'IDE crap embargo' holds for everyone.

> to correct it and I'll surely merge it into the base if I agree it makes
> sense. That's the way to get changes done if you feel something should
> be different, snide remarks with basically zero detail is not.

I think I provided enough details few times already.
You can always ask in case of problems (keep linux-ide@ cc:-ed).

[ First thing to do is to use REQ_DRIVE_TASKFILE not REQ_DRIVE_TASK. ]

Cheers,
Bartlomiej

2004-06-04 15:23:56

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> On Friday 04 of June 2004 14:47, Jens Axboe wrote:
> > On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > > Well, thanks but I still think that your patch suits crappy code
> > > perfectly (you know all the complains).
> >
> > I'm not on a crusade to clean up drivers/ide, in fact I could not care
> > less it if rots away (thank fully it is doing just that, pata is going
> > away). Most of your complaints are not valid in my opinion (->wrq usage
>
> You are missing two facts:
> - I'm on the _crusade_ to clean drivers/ide and merge them with libata later

I'm well aware of that.

> - pata is (slowly) going away but support for it is not going _anywhere_
> (although some people are smoking 'libata pata' crack)


> > is fine. it's not pretty, but it's not broken as long as you serialize
> > access across the hwgroup of course). Like the rest, it's an artifact of
> > how messy the code paths are in there. That could be cleaned too
> > naturally, but that's someone elses job and I'm not about to increase my
> > work load in that area.
>
> Yep, you prefer to increase my work load instead.

If you think that any change to the ide base is increasing your work
load, then yes. Otherwise no.

> > That you need to queue pre/post flushes to support barriers is a _driver
> > implementation detail_ in my opinion. You don't want to even advertise
>
> It is implementation braindamage IMO (but I'll buy it if rest is OK).

Well feel free to pull a rabbit out of your hat and suggest something
else that works for barriers. It's mind boggling that nothing so far has
come out of t13 to address this, I guess data integrity isn't high on
their list.

So in short, either shut up or put up.

> > that to upper layers. I will move a little of that into the block layer,
> > if only because SATA will need it as well.
> >
> > As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do take
> > patches! If there's something you consider broken, damnit send a patch
>
> It is _your_ job to do it properly.

I _am_ doing it properly. If you think otherwise, then I suggest you
show in code what you want changed. If you think it's my job to keep
changing the code based on unclear suggestions, then you are sadly
mistaken.

> There are no double standards, 'IDE crap embargo' holds for everyone.

Like it or not, but ide code needs changing to support barriers one way
or the other. If there's some part of the implementation you don't like,
then I suggest you show why. Since we appear to have reached a
discussion dead lock, I suggest you do so by showing a patch changing eg
the ide_get_error_location() stuff. Sadly you could have done this
roughly 10 times in the same time frame that you have written these
emails.

> > to correct it and I'll surely merge it into the base if I agree it makes
> > sense. That's the way to get changes done if you feel something should
> > be different, snide remarks with basically zero detail is not.
>
> I think I provided enough details few times already.
> You can always ask in case of problems (keep linux-ide@ cc:-ed).
>
> [ First thing to do is to use REQ_DRIVE_TASKFILE not REQ_DRIVE_TASK. ]

REQ_DRIVE_TASKFILE change I agree with, and yeah you have given enough
detail there. And I'll work iget_get_error_location() to fill the holes
in case of flush errors. I'll get that change done soonish and post
updates for -mm.

--
Jens Axboe

Subject: Re: ide errors in 7-rc1-mm1 and later

On Friday 04 of June 2004 17:23, Jens Axboe wrote:
> On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > On Friday 04 of June 2004 14:47, Jens Axboe wrote:
> > > On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > > > Well, thanks but I still think that your patch suits crappy code
> > > > perfectly (you know all the complains).
> > >
> > > I'm not on a crusade to clean up drivers/ide, in fact I could not care
> > > less it if rots away (thank fully it is doing just that, pata is going
> > > away). Most of your complaints are not valid in my opinion (->wrq usage
> >
> > You are missing two facts:
> > - I'm on the _crusade_ to clean drivers/ide and merge them with libata
> > later
>
> I'm well aware of that.
>
> > - pata is (slowly) going away but support for it is not going _anywhere_
> > (although some people are smoking 'libata pata' crack)
> >
> > > is fine. it's not pretty, but it's not broken as long as you serialize
> > > access across the hwgroup of course). Like the rest, it's an artifact
> > > of how messy the code paths are in there. That could be cleaned too
> > > naturally, but that's someone elses job and I'm not about to increase
> > > my work load in that area.
> >
> > Yep, you prefer to increase my work load instead.
>
> If you think that any change to the ide base is increasing your work
> load, then yes. Otherwise no.

No, only the messy ones.

> > > That you need to queue pre/post flushes to support barriers is a
> > > _driver implementation detail_ in my opinion. You don't want to even
> > > advertise
> >
> > It is implementation braindamage IMO (but I'll buy it if rest is OK).
>
> Well feel free to pull a rabbit out of your hat and suggest something
> else that works for barriers. It's mind boggling that nothing so far has
> come out of t13 to address this, I guess data integrity isn't high on
> their list.
>
> So in short, either shut up or put up.

Yeah, this the hardest part. I'll see what can be done.

> > > that to upper layers. I will move a little of that into the block
> > > layer, if only because SATA will need it as well.
> > >
> > > As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do take
> > > patches! If there's something you consider broken, damnit send a patch
> >
> > It is _your_ job to do it properly.
>
> I _am_ doing it properly. If you think otherwise, then I suggest you
> show in code what you want changed. If you think it's my job to keep
> changing the code based on unclear suggestions, then you are sadly
> mistaken.

Suggestions were clear, you've chosen to ignore them wishing that
I will correct the patch or that you will push patch upstream anyway.

> > There are no double standards, 'IDE crap embargo' holds for everyone.
>
> Like it or not, but ide code needs changing to support barriers one way

Rule is simple "no more crappola in IDE" and I don't care what your
patch does if this rule is violated.

> or the other. If there's some part of the implementation you don't like,
> then I suggest you show why. Since we appear to have reached a

Damn, I showed it few times. You seem to contradict yourself.

> discussion dead lock, I suggest you do so by showing a patch changing eg
> the ide_get_error_location() stuff. Sadly you could have done this
> roughly 10 times in the same time frame that you have written these
> emails.

Are you trying to trick me into doing your task?

> > > to correct it and I'll surely merge it into the base if I agree it
> > > makes sense. That's the way to get changes done if you feel something
> > > should be different, snide remarks with basically zero detail is not.
> >
> > I think I provided enough details few times already.
> > You can always ask in case of problems (keep linux-ide@ cc:-ed).
> >
> > [ First thing to do is to use REQ_DRIVE_TASKFILE not REQ_DRIVE_TASK. ]
>
> REQ_DRIVE_TASKFILE change I agree with, and yeah you have given enough
> detail there. And I'll work iget_get_error_location() to fill the holes
> in case of flush errors. I'll get that change done soonish and post
> updates for -mm.

2004-06-04 17:32:40

by Jeff Garzik

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Jens Axboe wrote:
> else that works for barriers. It's mind boggling that nothing so far has
> come out of t13 to address this, I guess data integrity isn't high on
> their list.


Chuckle :)

Personally I look at it the other way around -- why hasn't anybody on
the OS side written up a proposal that satisfies 100% of the OS barrier
needs?

We've got the device manufacturer contacts these days to get serious
attention paid, IMO. Just need the proposal now.

Just like Linux, ATA evolves in the direction that people speak up
about... I'll leave it to the audience to decide if Windows and data
integrity go hand-in-hand <grin>

Jeff


2004-06-05 09:19:17

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > > Yep, you prefer to increase my work load instead.
> >
> > If you think that any change to the ide base is increasing your work
> > load, then yes. Otherwise no.
>
> No, only the messy ones.
>
> > > > That you need to queue pre/post flushes to support barriers is a
> > > > _driver implementation detail_ in my opinion. You don't want to even
> > > > advertise
> > >
> > > It is implementation braindamage IMO (but I'll buy it if rest is OK).
> >
> > Well feel free to pull a rabbit out of your hat and suggest something
> > else that works for barriers. It's mind boggling that nothing so far has
> > come out of t13 to address this, I guess data integrity isn't high on
> > their list.
> >
> > So in short, either shut up or put up.
>
> Yeah, this the hardest part. I'll see what can be done.
>
> > > > that to upper layers. I will move a little of that into the block
> > > > layer, if only because SATA will need it as well.
> > > >
> > > > As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do take
> > > > patches! If there's something you consider broken, damnit send a patch
> > >
> > > It is _your_ job to do it properly.
> >
> > I _am_ doing it properly. If you think otherwise, then I suggest you
> > show in code what you want changed. If you think it's my job to keep
> > changing the code based on unclear suggestions, then you are sadly
> > mistaken.
>
> Suggestions were clear, you've chosen to ignore them wishing that
> I will correct the patch or that you will push patch upstream anyway.

And you seem to think that an IDE maintainers listing provides you with
a magical wand that says what goes and doesn't. You might want to check
if that hat is fits too tightly. Generally, I'd like folks to help out.
And generally, I like people to provide code comments that are to the
point - or, even better, show what they mean with a patch. If you have
no technical arguments except saying 'crap', then don't expect me to put
much value into your comments.

> > > There are no double standards, 'IDE crap embargo' holds for everyone.
> >
> > Like it or not, but ide code needs changing to support barriers one way
>
> Rule is simple "no more crappola in IDE" and I don't care what your
> patch does if this rule is violated.

I'm really sick of having this debate, it's a complete waste of time.
I'm not looking for your approval or anything in that order, and since
we don't agree all the points in solving this problem, there's no point
in continuing.

> > or the other. If there's some part of the implementation you don't like,
> > then I suggest you show why. Since we appear to have reached a
>
> Damn, I showed it few times. You seem to contradict yourself.

A few of the points. Your main argument on the pre/post flush business
makes zero sense still, and that seems to be the heart of your
'crappola' argument.

I already said that I can move the business of queueing post/pre flushes
into the block core instead. You seem to the very way of using pre/post
flushes to provide barriers, and to that I can only say tough shit.
Unless you can pull a rabbit out of your hat and suggest something
better, then your 'crappola' argument holds absolutely no grounds
whatsoever. The pre/post flush approach has worked successfully, it's
been tested extensively, and it works. Your pipe dreams of absolutely no
substance need no further comments.

> > discussion dead lock, I suggest you do so by showing a patch changing eg
> > the ide_get_error_location() stuff. Sadly you could have done this
> > roughly 10 times in the same time frame that you have written these
> > emails.
>
> Are you trying to trick me into doing your task?

I don't know why you keep thinking this is my job to complete this
project 100% on my own?! There's a general problem that needs solving,
and I would hope that others would be willing to help out where needed.
I would encourage people to help out if they care about the issue.

I'm not going to comment further on your mails in this thread, unless
they have substantial technical comment. Your 'crap' arguments so far
have been largely unsubstantiated, and as such they don't accomplish
much except waste time.

--
Jens Axboe

2004-06-05 09:25:21

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Fri, Jun 04 2004, Jeff Garzik wrote:
> Jens Axboe wrote:
> >else that works for barriers. It's mind boggling that nothing so far has
> >come out of t13 to address this, I guess data integrity isn't high on
> >their list.
>
>
> Chuckle :)
>
> Personally I look at it the other way around -- why hasn't anybody on
> the OS side written up a proposal that satisfies 100% of the OS barrier
> needs?
>
> We've got the device manufacturer contacts these days to get serious
> attention paid, IMO. Just need the proposal now.
>
> Just like Linux, ATA evolves in the direction that people speak up
> about... I'll leave it to the audience to decide if Windows and data
> integrity go hand-in-hand <grin>

I did suggest this a few years ago. The comment I received was that
they didn't take suggestions from OS people, if I didn't have a drive
implementation to go with the proposal they couldn't use it for
anything. Which was interesting, since that seemed to suggest that t13
had little steering in ata development, they mainly put into the ATA
specs what drive manufacturers shoved at them. Of course this isn't 100%
true, but it does explain a lot of things :-)

Andre even tried getting FUA to do what we needed, no such luck there.
Some other bigger OS wanted it differently, the rest is history.

There's nothing I would love more than being able to kill the pre and
post flushes and use something more effective. So if we can write up a
proposal that has some chance of being debated, I'm all for it.

--
Jens Axboe

2004-06-06 16:16:17

by Eric D. Mudama

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Sat, Jun 5 at 11:24, Jens Axboe wrote:
>I did suggest this a few years ago. The comment I received was that
>they didn't take suggestions from OS people, if I didn't have a drive
>implementation to go with the proposal they couldn't use it for
>anything. Which was interesting, since that seemed to suggest that t13
>had little steering in ata development, they mainly put into the ATA
>specs what drive manufacturers shoved at them. Of course this isn't 100%
>true, but it does explain a lot of things :-)

If it helps, I'm listening.

Suggestions/proposals for new features etc, if they're a good idea, I
can help push inside via our SATA/T13 reps. Note that as per all
long-lived specs with multiple revisions, changing the behavior of an
existing feature to something incompatible is virtually never
feasable.

>Andre even tried getting FUA to do what we needed, no such luck there.
>Some other bigger OS wanted it differently, the rest is history.

Lo siento, I wasn't around when that occurred. Of course, that other
bigger OS has a very large installed base, and selling a drive that
breaks it is corporate suicide.


--
Eric D. Mudama
[email protected]

2004-06-06 20:46:42

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Sun, Jun 06 2004, Eric D. Mudama wrote:
> On Sat, Jun 5 at 11:24, Jens Axboe wrote:
> >I did suggest this a few years ago. The comment I received was that
> >they didn't take suggestions from OS people, if I didn't have a drive
> >implementation to go with the proposal they couldn't use it for
> >anything. Which was interesting, since that seemed to suggest that t13
> >had little steering in ata development, they mainly put into the ATA
> >specs what drive manufacturers shoved at them. Of course this isn't 100%
> >true, but it does explain a lot of things :-)
>
> If it helps, I'm listening.
>
> Suggestions/proposals for new features etc, if they're a good idea, I
> can help push inside via our SATA/T13 reps. Note that as per all
> long-lived specs with multiple revisions, changing the behavior of an
> existing feature to something incompatible is virtually never
> feasable.

Of course not, you cannot change the way the command works now. This was
at the time when the proposal was being added, however.

There are still the feature register which is reserved for write fua,
that could be used. For some odd reason t13 prefers to add seperate
opcode for the identical command, instead of just using option bits. But
you could just flag an ordered bit for WRITE_DMA_EXT_FUA, that would
work wonders.

> >Andre even tried getting FUA to do what we needed, no such luck there.
> >Some other bigger OS wanted it differently, the rest is history.
>
> Lo siento, I wasn't around when that occurred. Of course, that other
> bigger OS has a very large installed base, and selling a drive that
> breaks it is corporate suicide.

I don't think anyone in their right mind would expect that. Of course in
10 years time we can all laugh at this when the tables have turned :-)

--
Jens Axboe

Subject: Re: ide errors in 7-rc1-mm1 and later


[ end of flaming + new technical arguments, please read ]

On Saturday 05 of June 2004 11:18, Jens Axboe wrote:
> On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > > > Yep, you prefer to increase my work load instead.
> > >
> > > If you think that any change to the ide base is increasing your work
> > > load, then yes. Otherwise no.
> >
> > No, only the messy ones.
> >
> > > > > That you need to queue pre/post flushes to support barriers is a
> > > > > _driver implementation detail_ in my opinion. You don't want to
> > > > > even advertise
> > > >
> > > > It is implementation braindamage IMO (but I'll buy it if rest is OK).
> > >
> > > Well feel free to pull a rabbit out of your hat and suggest something
> > > else that works for barriers. It's mind boggling that nothing so far
> > > has come out of t13 to address this, I guess data integrity isn't high
> > > on their list.
> > >
> > > So in short, either shut up or put up.
> >
> > Yeah, this the hardest part. I'll see what can be done.
> >
> > > > > that to upper layers. I will move a little of that into the block
> > > > > layer, if only because SATA will need it as well.
> > > > >
> > > > > As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do
> > > > > take patches! If there's something you consider broken, damnit send
> > > > > a patch
> > > >
> > > > It is _your_ job to do it properly.
> > >
> > > I _am_ doing it properly. If you think otherwise, then I suggest you
> > > show in code what you want changed. If you think it's my job to keep
> > > changing the code based on unclear suggestions, then you are sadly
> > > mistaken.
> >
> > Suggestions were clear, you've chosen to ignore them wishing that
> > I will correct the patch or that you will push patch upstream anyway.
>
> And you seem to think that an IDE maintainers listing provides you with
> a magical wand that says what goes and doesn't. You might want to check
> if that hat is fits too tightly. Generally, I'd like folks to help out.

Sure, you don't need my ACK, that's obvious - you need it from Linus/Andrew.

> And generally, I like people to provide code comments that are to the
> point - or, even better, show what they mean with a patch. If you have
> no technical arguments except saying 'crap', then don't expect me to put
> much value into your comments.

I don't think this is true but I'll try to be more 'technical' from now on.

> > > > There are no double standards, 'IDE crap embargo' holds for everyone.
> > >
> > > Like it or not, but ide code needs changing to support barriers one way
> >
> > Rule is simple "no more crappola in IDE" and I don't care what your
> > patch does if this rule is violated.
>
> I'm really sick of having this debate, it's a complete waste of time.
> I'm not looking for your approval or anything in that order, and since

I hope that people doing block layer changes won't get the same attitude.

> we don't agree all the points in solving this problem, there's no point
> in continuing.

I tried to redo IDE part but discovered nasty design problem, more below.

> > > or the other. If there's some part of the implementation you don't
> > > like, then I suggest you show why. Since we appear to have reached a
> >
> > Damn, I showed it few times. You seem to contradict yourself.
>
> A few of the points. Your main argument on the pre/post flush business
> makes zero sense still, and that seems to be the heart of your
> 'crappola' argument.
>
> I already said that I can move the business of queueing post/pre flushes
> into the block core instead. You seem to the very way of using pre/post
> flushes to provide barriers, and to that I can only say tough shit.
> Unless you can pull a rabbit out of your hat and suggest something
> better, then your 'crappola' argument holds absolutely no grounds
> whatsoever. The pre/post flush approach has worked successfully, it's
> been tested extensively, and it works. Your pipe dreams of absolutely no
> substance need no further comments.

It currently works this way:

pre flush (whole disk cache) + write + post flush (whole disk cache)

This is private to IDE code, higher layers do not know about it.

write + flush (whole disk cache)

Is sufficient because you can failed sector number and see if it belongs
to your write request. Pre flush can't help you in any way with previous
requests because they were already ACK-ed to higher layers.

Please correct me if I'm missing something.

> > > discussion dead lock, I suggest you do so by showing a patch changing
> > > eg the ide_get_error_location() stuff. Sadly you could have done this
> > > roughly 10 times in the same time frame that you have written these
> > > emails.
> >
> > Are you trying to trick me into doing your task?
>
> I don't know why you keep thinking this is my job to complete this
> project 100% on my own?! There's a general problem that needs solving,
> and I would hope that others would be willing to help out where needed.
> I would encourage people to help out if they care about the issue.

Please note that barrier patches are a new feature not a bugfix as
you can always disable write cache unless buggy firmware/disk but in
this case you can't be sure if they don't lie about flushes too.

Yes, thats suck for performance but you can instead get drives which
expire their caches (most do?) and UPS (they are really cheap nowadays).

;-)

I like the idea of flush barriers but I see more and more problems
to do it sanely.

> I'm not going to comment further on your mails in this thread, unless
> they have substantial technical comment. Your 'crap' arguments so far
> have been largely unsubstantiated, and as such they don't accomplish
> much except waste time.

OK. I tried to rewrite IDE part and discovered this:

+int ide_end_request (ide_drive_t *drive, int uptodate, int nr_sectors)
+{
+ struct request *rq;
+ unsigned long flags;
+ int ret = 1;
+
+ spin_lock_irqsave(&ide_lock, flags);
+ rq = HWGROUP(drive)->rq;
+
+ if (!nr_sectors)
+ nr_sectors = rq->hard_cur_sectors;
+
+ if (!blk_barrier_rq(rq))
+ ret = __ide_end_request(drive, rq, uptodate, nr_sectors);
+ else {

It seems that __ide_end_request() and thus end_that_request_first()
is called only once for the real_rq request - it breaks partial completions
which are by IDE PIO code (-> it breaks IDE PIO).

I don't see an easy way to fix it because if we do partial completions
we'll ACK some bios to higher layers before doing flush.

Fixing IDE not to do partial completions is also not easy
(I'm doing it slowly).

+ struct request *flush_rq = &HWGROUP(drive)->wrq;
+
+ flush_rq->nr_sectors -= nr_sectors;
+ if (!flush_rq->nr_sectors) {
+ ide_queue_flush_cmd(drive, rq, 1);
+ ret = 0;
+ }
+ }

BTW are you aware of two (minor?) corner cases of the current implementation?

- you can't have journal on a separate device
(pre and post flushes will only flush device storing journal not data)

- if you more than > 1 filesystem on the disk (quite likely scenario) it
can happen that barrier (flush) will fail for sector for file from the
other fs and later barrier for this other fs will succeed

[ If you see any mistakes in my comments please correct them.
I tried to be as accurate as possible. ]

Thanks,
Bartlomiej

2004-06-09 22:04:23

by Andrew Morton

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Bartlomiej Zolnierkiewicz <[email protected]> wrote:
>
> Sure, you don't need my ACK, that's obvious - you need it from Linus/Andrew.

But your nack would almost certainly prevent a merge, pending resolution of
whatever the issues are.

>
> ...
>
> BTW are you aware of two (minor?) corner cases of the current implementation?
>
> - you can't have journal on a separate device
> (pre and post flushes will only flush device storing journal not data)

External journals in ext3 aren't really supported - they just happen to
work as a plaything. I haven't tested it in several years, but I believe
people do use it.

That being said, the bug you identify is an ext3 bug. The easiest way for
me to fix it up within ext3 would be to issue some flush command to the
filesystem's disk, wait for that to complete, then write the buffer_ordered
commit block to the journal's disk. That's blkdev_issue_flush(), yes?

> - if you more than > 1 filesystem on the disk (quite likely scenario) it
> can happen that barrier (flush) will fail for sector for file from the
> other fs and later barrier for this other fs will succeed

I don't understand this one.

Subject: Re: ide errors in 7-rc1-mm1 and later

On Thursday 10 of June 2004 00:06, Andrew Morton wrote:
> Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > Sure, you don't need my ACK, that's obvious - you need it from
> > Linus/Andrew.
>
> But your nack would almost certainly prevent a merge, pending resolution of
> whatever the issues are.

Thanks.

> > ...
> >
> > BTW are you aware of two (minor?) corner cases of the current
> > implementation?
> >
> > - you can't have journal on a separate device
> > (pre and post flushes will only flush device storing journal not data)
>
> External journals in ext3 aren't really supported - they just happen to
> work as a plaything. I haven't tested it in several years, but I believe
> people do use it.
>
> That being said, the bug you identify is an ext3 bug. The easiest way for

The same is probably true for the current reiserfs barrier patch.

> me to fix it up within ext3 would be to issue some flush command to the
> filesystem's disk, wait for that to complete, then write the buffer_ordered
> commit block to the journal's disk. That's blkdev_issue_flush(), yes?

Yes and I think that it should work just the way you've just described.

Instead of pre+post flushes in ordered write:

flush from fs to sync disk + ordered commit (write+flush)

or even:

flush from fs to sync disk + commit

The latter is a bit less secure but we can also have 'unfortunate' power
failures for ordered write: while writing or between write and actual
flush (although 'race window' is smaller). Anyway ordering is preserved
(commit won't hit platters before the real data!) and it is a _lot_ simpler
(+ a bit faster). This solution is 'good enough' for me but the former
one is also okay (unlike 'pre/post flushes private to the IDE driver')
but requires solving 'we need partial completions for IDE' problem first.

Does journal has checksum or some other protection against failure during
writing journal to a disk? If not than it still can be screwed even with
ordered writes if we are unfortunate enough. ;-)

> > - if you more than > 1 filesystem on the disk (quite likely scenario) it
> > can happen that barrier (flush) will fail for sector for file from the
> > other fs and later barrier for this other fs will succeed
>
> I don't understand this one.

Flush command can fail for sector which came into disk's write cache
from some write request for some other fs on the same disk i.e.

write requests for fs 'a' (sector 'x' stays in write cache)
write requests for fs 'b'
commit log for fs 'b' -> barrier for fs 'b'
barrier fails because of sector 'x'
commit log for fs 'a' -> barrier for fs 'a'
barrier succeeds

Such scenario is highly unlikely (disks do bad sector re-allocation
on write) but not impossible (pool of sectors for remapping is unlimited).
That's why I think it is a minor issue (but still worth to know about it).

This is just to make it clear that write barriers can help for power failures
and hard resets or disks which don't expire their caches (are there any?) but
can't help once things go wrong with a disk (and actually can make it worse).


2004-06-09 23:46:12

by Ed Tomlinson

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Hi,

I am still seeing these with 7-rc3-mm1... No extra diag info either. I would be
really nice to see this one fixed.

TIA
Ed Tomlinson

On June 3, 2004 10:07 pm, Ed Tomlinson wrote:
> I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> if I mount with barrier=0 (or just defaults). It would seem that something is
> sending my drive commands it does not understand...
>
> May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
>
> How can we find out what is wrong?
>
> This does not seem to be an error that corrupts the fs, it just slows things
> down when it hits a group of these. Note that they keep poping up - they
> do stop (I still get them hours after booting).
>
> TIA
> Ed Tomlinson
>
> ----------------------
> 7-mm4 ok
> 7-mm5 na
> 7-rc1-mm1 errors
> 7-rc2 ok
> 7-rc2-mm2 errors
>
> CONFIG_IDE=y
> CONFIG_BLK_DEV_IDE=y
>
> #
> # Please see Documentation/ide.txt for help/info on IDE drives
> #
> # CONFIG_BLK_DEV_HD_IDE is not set
> CONFIG_BLK_DEV_IDEDISK=y
> CONFIG_IDEDISK_MULTI_MODE=y
> # CONFIG_IDEDISK_STROKE is not set
> CONFIG_BLK_DEV_IDECD=m
> CONFIG_BLK_DEV_IDETAPE=m
> # CONFIG_BLK_DEV_IDEFLOPPY is not set
> CONFIG_BLK_DEV_IDESCSI=m
> # CONFIG_IDE_TASK_IOCTL is not set
> CONFIG_IDE_TASKFILE_IO=y
>
> #
> # IDE chipset support/bugfixes
> #
> CONFIG_IDE_GENERIC=y
> # CONFIG_BLK_DEV_CMD640 is not set
> CONFIG_BLK_DEV_IDEPNP=y
> CONFIG_BLK_DEV_IDEPCI=y
> CONFIG_IDEPCI_SHARE_IRQ=y
> # CONFIG_BLK_DEV_OFFBOARD is not set
> # CONFIG_BLK_DEV_GENERIC is not set
> # CONFIG_BLK_DEV_OPTI621 is not set
> # CONFIG_BLK_DEV_RZ1000 is not set
> CONFIG_BLK_DEV_IDEDMA_PCI=y
> # CONFIG_BLK_DEV_IDEDMA_FORCED is not set
> CONFIG_IDEDMA_PCI_AUTO=y
> # CONFIG_IDEDMA_ONLYDISK is not set
> CONFIG_BLK_DEV_ADMA=y
> # CONFIG_BLK_DEV_AEC62XX is not set
> # CONFIG_BLK_DEV_ALI15X3 is not set
> # CONFIG_BLK_DEV_AMD74XX is not set
> # CONFIG_BLK_DEV_ATIIXP is not set
> # CONFIG_BLK_DEV_CMD64X is not set
> # CONFIG_BLK_DEV_TRIFLEX is not set
> # CONFIG_BLK_DEV_CY82C693 is not set
> # CONFIG_BLK_DEV_CS5520 is not set
> # CONFIG_BLK_DEV_CS5530 is not set
> # CONFIG_BLK_DEV_HPT34X is not set
> # CONFIG_BLK_DEV_HPT366 is not set
> # CONFIG_BLK_DEV_SC1200 is not set
> CONFIG_BLK_DEV_PIIX=y
> # CONFIG_BLK_DEV_NS87415 is not set
> # CONFIG_BLK_DEV_PDC202XX_OLD is not set
> # CONFIG_BLK_DEV_PDC202XX_NEW is not set
> # CONFIG_BLK_DEV_SVWKS is not set
> # CONFIG_BLK_DEV_SIIMAGE is not set
> # CONFIG_BLK_DEV_SIS5513 is not set
> # CONFIG_BLK_DEV_SLC90E66 is not set
> # CONFIG_BLK_DEV_TRM290 is not set
> # CONFIG_BLK_DEV_VIA82CXXX is not set
> # CONFIG_IDE_ARM is not set
> # CONFIG_IDE_CHIPSETS is not set
> CONFIG_BLK_DEV_IDEDMA=y
> # CONFIG_IDEDMA_IVB is not set
> CONFIG_IDEDMA_AUTO=y
>
> > Think this is not just a barrier problem (unless barrier is the default).
> > One if my two drives gets the error below during operation.
> > The drive is the root drive and is mounted with defaults. 2.6.6-mm4
> > was the last kernel booted on this box. The 2.6.7-rc1-mm1 was compiled
> > with 2.95 with the following fs options:
> >
> > CONFIG_EXT2_FS=y
> > # CONFIG_EXT2_FS_XATTR is not set
> > CONFIG_EXT3_FS=m
> > # CONFIG_EXT3_FS_XATTR is not set
> > CONFIG_JBD=m
> > # CONFIG_JBD_DEBUG is not set
> > CONFIG_REISERFS_FS=y
> > # CONFIG_REISERFS_CHECK is not set
> > # CONFIG_REISERFS_PROC_INFO is not set
> > # CONFIG_REISERFS_FS_XATTR is not set
> > # CONFIG_JFS_FS is not set
> > # CONFIG_XFS_FS is not set
> > # CONFIG_MINIX_FS is not set
> > # CONFIG_ROMFS_FS is not set
> > # CONFIG_QUOTA is not set
> > # CONFIG_AUTOFS_FS is not set
> > CONFIG_AUTOFS4_FS=m
>
> Disk /dev/hda: 6448 MB, 6448619520 bytes
> 240 heads, 63 sectors/track, 833 cylinders
> Units = cylinders of 15120 * 512 = 7741440 bytes
>
> Device Boot Start End Blocks Id System
> /dev/hda1 1 99 748408+ 82 Linux swap
> /dev/hda2 100 108 68040 83 Linux
> /dev/hda3 * 109 833 5481000 83 Linux
>
> > hda reports:
> > root@bert:/usr/src/linux# hdparm -iI /dev/hda
> >
> > /dev/hda:
> >
> > Model=WDC AC26400R, FwRev=15.01J15, SerialNo=WD-WM6271600165
> > Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
> > RawCHS=13328/15/63, TrkSize=57600, SectSize=600, ECCbytes=40
> > BuffType=DualPortCache, BuffSize=512kB, MaxMultSect=16, MultSect=16
> > CurCHS=13328/15/63, CurSects=12594960, LBA=yes, LBAsects=12594960
> > IORDY=on/off, tPIO={min:160,w/IORDY:120}, tDMA={min:120,rec:120}
> > PIO modes: pio0 pio1 pio2 pio3 pio4
> > DMA modes: mdma0 mdma1 mdma2
> > UDMA modes: udma0 udma1 *udma2 udma3 udma4
> > AdvancedPM=no WriteCache=enabled
> > Drive conforms to: device does not report version: 1 2 3 4
> >
> > * signifies the current active mode
> >
> >
> > ATA device, with non-removable media
> > Model Number: WDC AC26400R
> > Serial Number: WD-WM6271600165
> > Firmware Revision: 15.01J15
> > Standards:
> > Supported: 4 3 2 1
> > Likely used: 4
> > Configuration:
> > Logical max current
> > cylinders 13328 13328
> > heads 15 15
> > sectors/track 63 63
> > --
> > bytes/track: 57600 bytes/sector: 600
> > CHS current addressable sectors: 12594960
> > LBA user addressable sectors: 12594960
> > device size with M = 1024*1024: 6149 MBytes
> > device size with M = 1000*1000: 6448 MBytes (6 GB)
> > Capabilities:
> > LBA, IORDY(can be disabled)
> > Buffer size: 512.0kB bytes avail on r/w long: 40 Queue depth: 1
> > Standby timer values: spec'd by Standard, no device specific minimum
> > R/W multiple sector transfer: Max = 16 Current = 16
> > DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4
> > Cycle time: min=120ns recommended=120ns
> > PIO: pio0 pio1 pio2 pio3 pio4
> > Cycle time: no flow control=160ns IORDY flow control=120ns
> > Commands/features:
> > Enabled Supported:
> > * READ BUFFER cmd
> > * WRITE BUFFER cmd
> > * Look-ahead
> > * Write cache
> > * Power Management feature set
> > * SMART feature set
> >
> > root@bert:/usr/src/linux# hdparm -iI /dev/hdb
> >
> > /dev/hdb:
> >
> > Model=Maxtor 6E030L0, FwRev=NAR61590, SerialNo=E178CV5E
> > Config={ Fixed }
> > RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
> > BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
> > CurCHS=17475/15/63, CurSects=16513875, LBA=yes, LBAsects=60058656
> > IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> > PIO modes: pio0 pio1 pio2 pio3 pio4
> > DMA modes: mdma0 mdma1 mdma2
> > UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5 udma6
> > AdvancedPM=yes: disabled (255) WriteCache=enabled
> > Drive conforms to: (null):
> >
> > * signifies the current active mode
> >
> >
> > ATA device, with non-removable media
> > Model Number: Maxtor 6E030L0
> > Serial Number: E178CV5E
> > Firmware Revision: NAR61590
> > Standards:
> > Supported: 7 6 5 4
> > Likely used: 7
> > Configuration:
> > Logical max current
> > cylinders 16383 17475
> > heads 16 15
> > sectors/track 63 63
> > --
> > CHS current addressable sectors: 16513875
> > LBA user addressable sectors: 60058656
> > device size with M = 1024*1024: 29325 MBytes
> > device size with M = 1000*1000: 30750 MBytes (30 GB)
> > Capabilities:
> > LBA, IORDY(can be disabled)
> > Queue depth: 1
> > Standby timer values: spec'd by Standard, no device specific minimum
> > R/W multiple sector transfer: Max = 16 Current = 16
> > Advanced power management level: unknown setting (0x0000)
> > Recommended acoustic management value: 192, current value: 254
> > DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 udma6
> > Cycle time: min=120ns recommended=120ns
> > PIO: pio0 pio1 pio2 pio3 pio4
> > Cycle time: no flow control=120ns IORDY flow control=120ns
> > Commands/features:
> > Enabled Supported:
> > * NOP cmd
> > * READ BUFFER cmd
> > * WRITE BUFFER cmd
> > * Host Protected Area feature set
> > * Look-ahead
> > * Write cache
> > * Power Management feature set
> > Security Mode feature set
> > * SMART feature set
> > * FLUSH CACHE EXT command
> > * Mandatory FLUSH CACHE command
> > * Device Configuration Overlay feature set
> > * Automatic Acoustic Management feature set
> > SET MAX security extension
> > Advanced Power Management feature set
> > * DOWNLOAD MICROCODE cmd
> > * SMART self-test
> > * SMART error logging
> > Security:
> > Master password revision code = 65534
> > supported
> > not enabled
> > not locked
> > not frozen
> > not expired: security count
> > not supported: enhanced erase
> > HW reset results:
> > CBLID- above Vih
> > Device num = 1 determined by CSEL
> > Checksum: correct
> >
> > hdb is accessed via dm and evms. This is what the boot of reports:
> >
> > May 27 18:17:39 bert kernel: Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> > May 27 18:17:39 bert kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> > May 27 18:17:39 bert kernel: PIIX4: IDE controller at PCI slot 0000:00:14.1
> > May 27 18:17:39 bert kernel: PIIX4: chipset revision 1
> > May 27 18:17:39 bert kernel: PIIX4: not 100%% native mode: will probe irqs later
> > May 27 18:17:39 bert kernel: ide0: BM-DMA at 0x10c0-0x10c7, BIOS settings: hda:pio, hdb:DMA
> > May 27 18:17:39 bert kernel: ide1: BM-DMA at 0x10c8-0x10cf, BIOS settings: hdc:DMA, hdd:pio
> > May 27 18:17:39 bert kernel: hda: WDC AC26400R, ATA DISK drive
> > May 27 18:17:39 bert kernel: hdb: Maxtor 6E030L0, ATA DISK drive
> > May 27 18:17:39 bert kernel: Using anticipatory io scheduler
> > May 27 18:17:39 bert kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> > May 27 18:17:39 bert kernel: hdc: HL-DT-ST RW/DVD GCC-4480B, ATAPI CD/DVD-ROM drive
> > May 27 18:17:39 bert kernel: ide1 at 0x170-0x177,0x376 on irq 15
> > May 27 18:17:39 bert kernel: pnp: the driver 'ide' has been registered
> > May 27 18:17:39 bert kernel: hda: max request size: 128KiB
> > May 27 18:17:39 bert kernel: hda: 12594960 sectors (6448 MB) w/512KiB Cache, CHS=13328/15/63, UDMA(33)
> > May 27 18:17:39 bert kernel: hda: cache flushes supported
> > May 27 18:17:39 bert kernel: hda: hda1 hda2 hda3
> > May 27 18:17:39 bert kernel: hdb: max request size: 128KiB
> > May 27 18:17:39 bert kernel: hdb: 60058656 sectors (30750 MB) w/2048KiB Cache, CHS=59582/16/63, UDMA(33)
> > May 27 18:17:39 bert kernel: hdb: cache flushes supported
> > May 27 18:17:39 bert kernel: hdb: hdb1 hdb2 hdb3 hdb4 < hdb5 >
> >
> > followed later by:
> >
> > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > May 27 18:18:06 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:18:06 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > May 27 18:19:21 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:19:21 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > May 27 18:19:22 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:19:22 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > May 27 18:20:01 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:20:01 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > May 27 18:20:01 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:20:01 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > May 27 18:21:27 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:21:27 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> >
> >
> >
> >
> > Hope this help,
> >
> > Ed
> >
> > On May 27, 2004 04:24 pm, G?nther Persoons wrote:
> > > Hey,
> > > When i mount my reiser partitie with the option barrier=flush i get
> > > following message and error:
> > > My harddrive is a 2.5 inch Fujitsu 20GB IDE.
> > >
> > > mount /dev/hda10 /tmp -o barrier=flush
> > > mount: wrong fs type, bad option, bad superblock on /dev/hda10,
> > > or too many mounted file systems
> > > Log:
> > > ReiserFS: hda10: found reiserfs format "3.6" with standard journal
> > > ReiserFS: hda10: using ordered data mode
> > > reiserfs: using flush barriers
> > > ReiserFS: hda10: journal params: device hda10, size 8192, journal first
> > > block 18, max trans len 1024, max batch 900, max commit age 30, max
> > > trans age 30
> > > ReiserFS: hda10: checking transaction log (hda10)
> > > hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > hda: drive_cmd: error=0x04 { DriveStatusError }
> > > hda: barrier support doesn't work
> > > ReiserFS: hda10: warning: journal-837: IO error during journal replay
> > > ReiserFS: hda10: warning: Replay Failure, unable to mount
> > > ReiserFS: hda10: warning: sh-2022: reiserfs_fill_super: unable to
> > > initialize journal space
> >
>
>

2004-06-09 23:51:36

by Andrew Morton

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Ed Tomlinson <[email protected]> wrote:
>
> Hi,
>
> I am still seeing these with 7-rc3-mm1... No extra diag info either. I would be
> really nice to see this one fixed.

So ide-print-failed-opcode.patch isn't working. Presumably
HWGROUP(drive)->rq is null.

2004-06-09 23:50:46

by Andrew Morton

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Bartlomiej Zolnierkiewicz <[email protected]> wrote:
>
> Does journal has checksum or some other protection against failure during
> writing journal to a disk? If not than it still can be screwed even with
> ordered writes if we are unfortunate enough. ;-)

A transaction is written to disk as two synchronous operations: write all
the data, wait on it, write the single commit block, wait on that.

If the commit block were to hit disk before the data then we have a window
in which poweroff+recovery would replay garbage into the filesystem.

So I think we have a bug in the current ext3 barrier implementation - we
need a blk_issue_flush() before submitting the buffer_ordered commit block.


> > > - if you more than > 1 filesystem on the disk (quite likely scenario) it
> > > can happen that barrier (flush) will fail for sector for file from the
> > > other fs and later barrier for this other fs will succeed
> >
> > I don't understand this one.
>
> Flush command can fail for sector which came into disk's write cache
> from some write request for some other fs on the same disk i.e.

Oh, you're referring to actual I/O errors? I tend to think all bets are
off if that happens - make sure we report it to the application, avoid
crashing the kernel and hope that fsck can get the data back...

Subject: Re: ide errors in 7-rc1-mm1 and later

On Thursday 10 of June 2004 01:50, Andrew Morton wrote:
> Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > Does journal has checksum or some other protection against failure during
> > writing journal to a disk? If not than it still can be screwed even with
> > ordered writes if we are unfortunate enough. ;-)
>
> A transaction is written to disk as two synchronous operations: write all
> the data, wait on it, write the single commit block, wait on that.

That is how it looks from fs side, from disk side it may look like this:

write some data sectors (rest stays in cache)
write rest of data sectors (from cache)
write some commit sectors (rest stays in cache)
write rest of commit sectors (from cache)

fs atomic operations != disk atomic operations

> If the commit block were to hit disk before the data then we have a window
> in which poweroff+recovery would replay garbage into the filesystem.

Yes.

The quoted part of my mail is about situation when poweroff happens between
'write some commit sectors' and 'write rest of commit sectors (from cache)'
or during transferring commit sectors to a disk.

In both situations we end up with corrupted journal.

> So I think we have a bug in the current ext3 barrier implementation - we
> need a blk_issue_flush() before submitting the buffer_ordered commit block.

Sure. What's your opinion about doing blk_issue_flush() and ordinary commit
(pros+cons given in my previous mail)?

> > > > - if you more than > 1 filesystem on the disk (quite likely scenario)
> > > > it can happen that barrier (flush) will fail for sector for file from
> > > > the other fs and later barrier for this other fs will succeed
> > >
> > > I don't understand this one.
> >
> > Flush command can fail for sector which came into disk's write cache
> > from some write request for some other fs on the same disk i.e.
>
> Oh, you're referring to actual I/O errors? I tend to think all bets are
> off if that happens - make sure we report it to the application, avoid
> crashing the kernel and hope that fsck can get the data back...

Yes but things get more complicated with write caching on a disk side,
i.e. it is too late to report to application when you discover I/O error.

2004-06-10 00:18:13

by Ed Tomlinson

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On June 9, 2004 07:52 pm, Andrew Morton wrote:
> Ed Tomlinson <[email protected]> wrote:
> >
> > Hi,
> >
> > I am still seeing these with 7-rc3-mm1... No extra diag info either. I would be
> > really nice to see this one fixed.
>
> So ide-print-failed-opcode.patch isn't working. Presumably
> HWGROUP(drive)->rq is null.

No change from the first time I tried the patch.

Ed

2004-06-10 00:27:39

by Chris Mason

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Wed, 2004-06-09 at 19:50, Andrew Morton wrote:
> Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> >
> > Does journal has checksum or some other protection against failure during
> > writing journal to a disk? If not than it still can be screwed even with
> > ordered writes if we are unfortunate enough. ;-)
>
> A transaction is written to disk as two synchronous operations: write all
> the data, wait on it, write the single commit block, wait on that.
>
> If the commit block were to hit disk before the data then we have a window
> in which poweroff+recovery would replay garbage into the filesystem.
>
> So I think we have a bug in the current ext3 barrier implementation - we
> need a blk_issue_flush() before submitting the buffer_ordered commit block.

The IDE barriers are both a pre and post flush. If the commit block is
ordered, before the commit block hits the disk we know all the blocks
previously submitted are also on disk.

-chris


Subject: Re: ide errors in 7-rc1-mm1 and later


/me just thinks loudly

'linear range' FLUSH CACHE seems so easy to implement that I always wondered
why FLUSH CACHE command doesn't make any use of LBA address and number
of sectors.

On Sunday 06 of June 2004 18:18, Eric D. Mudama wrote:
> On Sat, Jun 5 at 11:24, Jens Axboe wrote:
> >I did suggest this a few years ago. The comment I received was that
> >they didn't take suggestions from OS people, if I didn't have a drive
> >implementation to go with the proposal they couldn't use it for
> >anything. Which was interesting, since that seemed to suggest that t13
> >had little steering in ata development, they mainly put into the ATA
> >specs what drive manufacturers shoved at them. Of course this isn't 100%
> >true, but it does explain a lot of things :-)
>
> If it helps, I'm listening.
>
> Suggestions/proposals for new features etc, if they're a good idea, I
> can help push inside via our SATA/T13 reps. Note that as per all
> long-lived specs with multiple revisions, changing the behavior of an
> existing feature to something incompatible is virtually never
> feasable.
>
> >Andre even tried getting FUA to do what we needed, no such luck there.
> >Some other bigger OS wanted it differently, the rest is history.
>
> Lo siento, I wasn't around when that occurred. Of course, that other
> bigger OS has a very large installed base, and selling a drive that
> breaks it is corporate suicide.

2004-06-10 00:35:24

by Andrew Morton

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Bartlomiej Zolnierkiewicz <[email protected]> wrote:
>
> On Thursday 10 of June 2004 01:50, Andrew Morton wrote:
> > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > Does journal has checksum or some other protection against failure during
> > > writing journal to a disk? If not than it still can be screwed even with
> > > ordered writes if we are unfortunate enough. ;-)
> >
> > A transaction is written to disk as two synchronous operations: write all
> > the data, wait on it, write the single commit block, wait on that.
>
> That is how it looks from fs side, from disk side it may look like this:
>
> write some data sectors (rest stays in cache)
> write rest of data sectors (from cache)
> write some commit sectors (rest stays in cache)
> write rest of commit sectors (from cache)
>
> fs atomic operations != disk atomic operations

JBD is careful about that. There is a single commit block (1, 2 or 4k) and
the first eight bytes of that block contain a magic number and a sequence
number. If they're not both valid then replay considers the entire
transaction (data blocks + commit block) to be invalid.

So all we care about is the atomicity of the first eight bytes of a single
512-byte sector. I see no problem with internal-to-commit-block write
reordering.

The problem is that the commit block may hit disk prior to the preceding
data blocks, which is why we need a full flush prior to submitting the
commit block.

> > If the commit block were to hit disk before the data then we have a window
> > in which poweroff+recovery would replay garbage into the filesystem.
>
> Yes.
>
> The quoted part of my mail is about situation when poweroff happens between
> 'write some commit sectors' and 'write rest of commit sectors (from cache)'
> or during transferring commit sectors to a disk.

There is just a single commit sector.

> Sure. What's your opinion about doing blk_issue_flush() and ordinary commit
> (pros+cons given in my previous mail)?

I think we need:

submit_data_sectors();
blkdev_issue_flush();
wait_on_data_sectors();

/*
* All of the transaction's data sectors are now on disk. Submit the
* commit sector
*/
mark_buffer_ordered(commit_bh);
submit_bh(commit_bh);
wait_on_buffer(commit_bh);

Or something like that. Haven't really looked at the blkdev_issue_flush()
design yet. It has this mysterious comment: "Caller must run
wait_for_completion() on its own.". Wait for what completion??

2004-06-10 00:36:28

by Andrew Morton

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Chris Mason <[email protected]> wrote:
>
> On Wed, 2004-06-09 at 19:50, Andrew Morton wrote:
> > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > >
> > > Does journal has checksum or some other protection against failure during
> > > writing journal to a disk? If not than it still can be screwed even with
> > > ordered writes if we are unfortunate enough. ;-)
> >
> > A transaction is written to disk as two synchronous operations: write all
> > the data, wait on it, write the single commit block, wait on that.
> >
> > If the commit block were to hit disk before the data then we have a window
> > in which poweroff+recovery would replay garbage into the filesystem.
> >
> > So I think we have a bug in the current ext3 barrier implementation - we
> > need a blk_issue_flush() before submitting the buffer_ordered commit block.
>
> The IDE barriers are both a pre and post flush. If the commit block is
> ordered, before the commit block hits the disk we know all the blocks
> previously submitted are also on disk.
>

Oh, OK. Will the same apply to (for example) scsi?

Subject: Re: ide errors in 7-rc1-mm1 and later

On Thursday 10 of June 2004 02:38, Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:
> > On Wed, 2004-06-09 at 19:50, Andrew Morton wrote:
> > > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > > Does journal has checksum or some other protection against failure
> > > > during writing journal to a disk? If not than it still can be
> > > > screwed even with ordered writes if we are unfortunate enough. ;-)
> > >
> > > A transaction is written to disk as two synchronous operations: write
> > > all the data, wait on it, write the single commit block, wait on that.
> > >
> > > If the commit block were to hit disk before the data then we have a
> > > window in which poweroff+recovery would replay garbage into the
> > > filesystem.
> > >
> > > So I think we have a bug in the current ext3 barrier implementation -
> > > we need a blk_issue_flush() before submitting the buffer_ordered commit
> > > block.
> >
> > The IDE barriers are both a pre and post flush. If the commit block is
> > ordered, before the commit block hits the disk we know all the blocks
> > previously submitted are also on disk.
>
> Oh, OK. Will the same apply to (for example) scsi?

Not OK. Chris, pre and post flushes are for the same device.
Journal may be on different device than filesystem!


Subject: Re: ide errors in 7-rc1-mm1 and later

On Thursday 10 of June 2004 02:37, Andrew Morton wrote:
> Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > On Thursday 10 of June 2004 01:50, Andrew Morton wrote:
> > > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > > Does journal has checksum or some other protection against failure
> > > > during writing journal to a disk? If not than it still can be
> > > > screwed even with ordered writes if we are unfortunate enough. ;-)
> > >
> > > A transaction is written to disk as two synchronous operations: write
> > > all the data, wait on it, write the single commit block, wait on that.
> >
> > That is how it looks from fs side, from disk side it may look like this:
> >
> > write some data sectors (rest stays in cache)
> > write rest of data sectors (from cache)
> > write some commit sectors (rest stays in cache)
> > write rest of commit sectors (from cache)
> >
> > fs atomic operations != disk atomic operations
>
> JBD is careful about that. There is a single commit block (1, 2 or 4k) and
> the first eight bytes of that block contain a magic number and a sequence
> number. If they're not both valid then replay considers the entire
> transaction (data blocks + commit block) to be invalid.
>
> So all we care about is the atomicity of the first eight bytes of a single
> 512-byte sector. I see no problem with internal-to-commit-block write
> reordering.

OK, thanks for explaining this.

> The problem is that the commit block may hit disk prior to the preceding
> data blocks, which is why we need a full flush prior to submitting the
> commit block.

Yes, yes, this is really obvious for me.
I was also worried about write cache vs commit block write.

> > > If the commit block were to hit disk before the data then we have a
> > > window in which poweroff+recovery would replay garbage into the
> > > filesystem.
> >
> > Yes.
> >
> > The quoted part of my mail is about situation when poweroff happens
> > between 'write some commit sectors' and 'write rest of commit sectors
> > (from cache)' or during transferring commit sectors to a disk.
>
> There is just a single commit sector.

Only one 512-bytes sector? Good!

> > Sure. What's your opinion about doing blk_issue_flush() and ordinary
> > commit (pros+cons given in my previous mail)?
>
> I think we need:
>
> submit_data_sectors();
> blkdev_issue_flush();
> wait_on_data_sectors();
>
> /*
> * All of the transaction's data sectors are now on disk. Submit the
> * commit sector
> */
> mark_buffer_ordered(commit_bh);

Ordered write is not really needed because the next
'data cycle' will provide us with needed ordering.

submit_data_sectors();
blkdev_issue_flush();

^^^
flushes previous commit before the new one is submitted

wait_on_data_sectors();

> submit_bh(commit_bh);
> wait_on_buffer(commit_bh);
>
> Or something like that. Haven't really looked at the blkdev_issue_flush()
> design yet. It has this mysterious comment: "Caller must run
> wait_for_completion() on its own.". Wait for what completion??

Subject: Re: ide errors in 7-rc1-mm1 and later

On Thursday 10 of June 2004 02:28, Chris Mason wrote:
> On Wed, 2004-06-09 at 19:50, Andrew Morton wrote:
> > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > Does journal has checksum or some other protection against failure
> > > during writing journal to a disk? If not than it still can be screwed
> > > even with ordered writes if we are unfortunate enough. ;-)
> >
> > A transaction is written to disk as two synchronous operations: write all
> > the data, wait on it, write the single commit block, wait on that.
> >
> > If the commit block were to hit disk before the data then we have a
> > window in which poweroff+recovery would replay garbage into the
> > filesystem.
> >
> > So I think we have a bug in the current ext3 barrier implementation - we
> > need a blk_issue_flush() before submitting the buffer_ordered commit
> > block.
>
> The IDE barriers are both a pre and post flush. If the commit block is
> ordered, before the commit block hits the disk we know all the blocks
> previously submitted are also on disk.

Please re-read my mail. Journal may be on differrent disk than filesystem.
IDE barries do pre and post flush but for the same device.

> -chris

2004-06-10 06:20:57

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Thu, Jun 10 2004, Bartlomiej Zolnierkiewicz wrote:
>
> /me just thinks loudly
>
> 'linear range' FLUSH CACHE seems so easy to implement that I always wondered
> why FLUSH CACHE command doesn't make any use of LBA address and number
> of sectors.

Indeed, that would be very helpful as well.

--
Jens Axboe

2004-06-10 06:26:29

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Wed, Jun 09 2004, Bartlomiej Zolnierkiewicz wrote:
>
> [ end of flaming + new technical arguments, please read ]

Super :-)

> On Saturday 05 of June 2004 11:18, Jens Axboe wrote:
> > On Fri, Jun 04 2004, Bartlomiej Zolnierkiewicz wrote:
> > > > > Yep, you prefer to increase my work load instead.
> > > >
> > > > If you think that any change to the ide base is increasing your work
> > > > load, then yes. Otherwise no.
> > >
> > > No, only the messy ones.
> > >
> > > > > > That you need to queue pre/post flushes to support barriers is a
> > > > > > _driver implementation detail_ in my opinion. You don't want to
> > > > > > even advertise
> > > > >
> > > > > It is implementation braindamage IMO (but I'll buy it if rest is OK).
> > > >
> > > > Well feel free to pull a rabbit out of your hat and suggest something
> > > > else that works for barriers. It's mind boggling that nothing so far
> > > > has come out of t13 to address this, I guess data integrity isn't high
> > > > on their list.
> > > >
> > > > So in short, either shut up or put up.
> > >
> > > Yeah, this the hardest part. I'll see what can be done.
> > >
> > > > > > that to upper layers. I will move a little of that into the block
> > > > > > layer, if only because SATA will need it as well.
> > > > > >
> > > > > > As for REQ_DRIVE_TASK and ide_get_error_location(), well hell I do
> > > > > > take patches! If there's something you consider broken, damnit send
> > > > > > a patch
> > > > >
> > > > > It is _your_ job to do it properly.
> > > >
> > > > I _am_ doing it properly. If you think otherwise, then I suggest you
> > > > show in code what you want changed. If you think it's my job to keep
> > > > changing the code based on unclear suggestions, then you are sadly
> > > > mistaken.
> > >
> > > Suggestions were clear, you've chosen to ignore them wishing that
> > > I will correct the patch or that you will push patch upstream anyway.
> >
> > And you seem to think that an IDE maintainers listing provides you with
> > a magical wand that says what goes and doesn't. You might want to check
> > if that hat is fits too tightly. Generally, I'd like folks to help out.
>
> Sure, you don't need my ACK, that's obvious - you need it from Linus/Andrew.

I didn't mean for merge, for 2.4 I quite happily carried it in the SUSE
tree. Of course the goal is getting it merged eventually, but time
restrictions the past months have just not made a whole lot of time
available to get it in such a shape.

> > > > > There are no double standards, 'IDE crap embargo' holds for everyone.
> > > >
> > > > Like it or not, but ide code needs changing to support barriers one way
> > >
> > > Rule is simple "no more crappola in IDE" and I don't care what your
> > > patch does if this rule is violated.
> >
> > I'm really sick of having this debate, it's a complete waste of time.
> > I'm not looking for your approval or anything in that order, and since
>
> I hope that people doing block layer changes won't get the same attitude.

Well I really hope that I wouldn't put them in this position like you
did, so yeah I agree.

> > we don't agree all the points in solving this problem, there's no point
> > in continuing.
>
> I tried to redo IDE part but discovered nasty design problem, more below.
>
> > > > or the other. If there's some part of the implementation you don't
> > > > like, then I suggest you show why. Since we appear to have reached a
> > >
> > > Damn, I showed it few times. You seem to contradict yourself.
> >
> > A few of the points. Your main argument on the pre/post flush business
> > makes zero sense still, and that seems to be the heart of your
> > 'crappola' argument.
> >
> > I already said that I can move the business of queueing post/pre flushes
> > into the block core instead. You seem to the very way of using pre/post
> > flushes to provide barriers, and to that I can only say tough shit.
> > Unless you can pull a rabbit out of your hat and suggest something
> > better, then your 'crappola' argument holds absolutely no grounds
> > whatsoever. The pre/post flush approach has worked successfully, it's
> > been tested extensively, and it works. Your pipe dreams of absolutely no
> > substance need no further comments.
>
> It currently works this way:
>
> pre flush (whole disk cache) + write + post flush (whole disk cache)
>
> This is private to IDE code, higher layers do not know about it.
>
> write + flush (whole disk cache)
>
> Is sufficient because you can failed sector number and see if it belongs
> to your write request. Pre flush can't help you in any way with previous
> requests because they were already ACK-ed to higher layers.
>
> Please correct me if I'm missing something.

There are ordering constraints between the submitted write and
previously submitted writes. If there weren't we could use FUA (provided
it was widely supported, which it appears not to be yet).

> > > > discussion dead lock, I suggest you do so by showing a patch changing
> > > > eg the ide_get_error_location() stuff. Sadly you could have done this
> > > > roughly 10 times in the same time frame that you have written these
> > > > emails.
> > >
> > > Are you trying to trick me into doing your task?
> >
> > I don't know why you keep thinking this is my job to complete this
> > project 100% on my own?! There's a general problem that needs solving,
> > and I would hope that others would be willing to help out where needed.
> > I would encourage people to help out if they care about the issue.
>
> Please note that barrier patches are a new feature not a bugfix as
> you can always disable write cache unless buggy firmware/disk but in
> this case you can't be sure if they don't lie about flushes too.
>
> Yes, thats suck for performance but you can instead get drives which
> expire their caches (most do?) and UPS (they are really cheap nowadays).
>
> ;-)

Heh, I'm not so sure anyone would agree this is a viable alternative :)

> I like the idea of flush barriers but I see more and more problems
> to do it sanely.

Yeah it's tricky...


> > I'm not going to comment further on your mails in this thread, unless
> > they have substantial technical comment. Your 'crap' arguments so far
> > have been largely unsubstantiated, and as such they don't accomplish
> > much except waste time.
>
> OK. I tried to rewrite IDE part and discovered this:
>
> +int ide_end_request (ide_drive_t *drive, int uptodate, int nr_sectors)
> +{
> + struct request *rq;
> + unsigned long flags;
> + int ret = 1;
> +
> + spin_lock_irqsave(&ide_lock, flags);
> + rq = HWGROUP(drive)->rq;
> +
> + if (!nr_sectors)
> + nr_sectors = rq->hard_cur_sectors;
> +
> + if (!blk_barrier_rq(rq))
> + ret = __ide_end_request(drive, rq, uptodate, nr_sectors);
> + else {
>
> It seems that __ide_end_request() and thus end_that_request_first()
> is called only once for the real_rq request - it breaks partial completions
> which are by IDE PIO code (-> it breaks IDE PIO).

There is a bug there it seems, but not the code you paste above. Right
now it just accounts the sectors processed, we need to move the data
pointers too of course.

> I don't see an easy way to fix it because if we do partial completions
> we'll ACK some bios to higher layers before doing flush.

Should be able to do it with you ->cbio additions.

> Fixing IDE not to do partial completions is also not easy
> (I'm doing it slowly).
>
> + struct request *flush_rq = &HWGROUP(drive)->wrq;
> +
> + flush_rq->nr_sectors -= nr_sectors;
> + if (!flush_rq->nr_sectors) {
> + ide_queue_flush_cmd(drive, rq, 1);
> + ret = 0;
> + }
> + }
>
> BTW are you aware of two (minor?) corner cases of the current implementation?
>
> - you can't have journal on a separate device
> (pre and post flushes will only flush device storing journal not data)

You can, but obviously not sending a barrier bio down the pipe since
that has a specific target on its back. You could use
blkdev_issue_flush() instead. But it's not optimal.

> - if you more than > 1 filesystem on the disk (quite likely scenario) it
> can happen that barrier (flush) will fail for sector for file from the
> other fs and later barrier for this other fs will succeed

I don't think so. You are pinning the disk for this operation, so you
know that noone will come in-between your pre/write/post cycle. It can
happen that the flush gets an error for another location on disk,
ide_complete_barrier() needs to see if that sector is in range (and
raport it), then ideally issue a new flush until no errors occur. And
now move on to the barrier.

> [ If you see any mistakes in my comments please correct them.
> I tried to be as accurate as possible. ]

Mail was fine, and great comments! Thanks for looking at this, I hope to
have more time to do so myself really soon.

--
Jens Axboe

2004-06-10 06:28:07

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Thu, Jun 10 2004, Bartlomiej Zolnierkiewicz wrote:
> > > - if you more than > 1 filesystem on the disk (quite likely scenario) it
> > > can happen that barrier (flush) will fail for sector for file from the
> > > other fs and later barrier for this other fs will succeed
> >
> > I don't understand this one.
>
> Flush command can fail for sector which came into disk's write cache
> from some write request for some other fs on the same disk i.e.
>
> write requests for fs 'a' (sector 'x' stays in write cache)
> write requests for fs 'b'
> commit log for fs 'b' -> barrier for fs 'b'
> barrier fails because of sector 'x'
> commit log for fs 'a' -> barrier for fs 'a'
> barrier succeeds

That's a bug in ide_complete_barrier(), like I outlined in the previous
mail you need to reissue the flush until no errors occur. A pre-flush
should not fail the barrier of course, since it has no relation to it.

> Such scenario is highly unlikely (disks do bad sector re-allocation
> on write) but not impossible (pool of sectors for remapping is unlimited).
> That's why I think it is a minor issue (but still worth to know about it).

Yes very, wants to work though...

--
Jens Axboe

2004-06-10 06:29:42

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Wed, Jun 09 2004, Andrew Morton wrote:
> Ed Tomlinson <[email protected]> wrote:
> >
> > Hi,
> >
> > I am still seeing these with 7-rc3-mm1... No extra diag info either. I would be
> > really nice to see this one fixed.
>
> So ide-print-failed-opcode.patch isn't working. Presumably
> HWGROUP(drive)->rq is null.

No, I just put the code in the wrong ->error() location since ide has
dupes of this sprinkled...

I'll get you one for ide-disk that works. It could be handy in the
future as well, it's always annoyed me that ide errors without telling
you what command failed.

--
Jens Axboe

2004-06-10 15:13:39

by Chris Mason

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Wed, 2004-06-09 at 20:38, Andrew Morton wrote:
> Chris Mason <[email protected]> wrote:
> >
> > On Wed, 2004-06-09 at 19:50, Andrew Morton wrote:
> > > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > >
> > > > Does journal has checksum or some other protection against failure during
> > > > writing journal to a disk? If not than it still can be screwed even with
> > > > ordered writes if we are unfortunate enough. ;-)
> > >
> > > A transaction is written to disk as two synchronous operations: write all
> > > the data, wait on it, write the single commit block, wait on that.
> > >
> > > If the commit block were to hit disk before the data then we have a window
> > > in which poweroff+recovery would replay garbage into the filesystem.
> > >
> > > So I think we have a bug in the current ext3 barrier implementation - we
> > > need a blk_issue_flush() before submitting the buffer_ordered commit block.
> >
> > The IDE barriers are both a pre and post flush. If the commit block is
> > ordered, before the commit block hits the disk we know all the blocks
> > previously submitted are also on disk.
> >
>
> Oh, OK. Will the same apply to (for example) scsi?

For scsi the general expectation is that write cache will be off unless
it is battery backed. blkdev_issue_flush does go down to scsi, but I'm
not sure about the regular WRITE_BARRIER stuff. Jens?

It's true that we need an extra step for external journals in both ext3
and reiser. We need extra flushes for O_SYNC and O_DIRECT as well, I
wanted to get the core basics working and API fixed before we sprinkling
flushes all over the kernel for complete coverage.

I just did some benchmarking of the two BH_Eopnotsupp patches I sent,
and for synctest -t 20 -f -n 1 dir, there's not enough difference
between barriers on and off for ext3. (1-2% at most). It doesn't look
like ext3_sync_file is triggering commits all the time, I think we need
extra flushes there too.

Andrew, both O_SYNC and ext3 fsync rely on inode->i_state & I_DIRTY to
decide when to call write_inode(wait = 1). What happens when a
background writeout clears I_DIRTY without triggering the commit? Looks
like we won't wait on the transaction to complete in this case.

-chris


2004-06-10 15:16:06

by Jens Axboe

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Thu, Jun 10 2004, Chris Mason wrote:
> On Wed, 2004-06-09 at 20:38, Andrew Morton wrote:
> > Chris Mason <[email protected]> wrote:
> > >
> > > On Wed, 2004-06-09 at 19:50, Andrew Morton wrote:
> > > > Bartlomiej Zolnierkiewicz <[email protected]> wrote:
> > > > >
> > > > > Does journal has checksum or some other protection against failure during
> > > > > writing journal to a disk? If not than it still can be screwed even with
> > > > > ordered writes if we are unfortunate enough. ;-)
> > > >
> > > > A transaction is written to disk as two synchronous operations: write all
> > > > the data, wait on it, write the single commit block, wait on that.
> > > >
> > > > If the commit block were to hit disk before the data then we have a window
> > > > in which poweroff+recovery would replay garbage into the filesystem.
> > > >
> > > > So I think we have a bug in the current ext3 barrier implementation - we
> > > > need a blk_issue_flush() before submitting the buffer_ordered commit block.
> > >
> > > The IDE barriers are both a pre and post flush. If the commit block is
> > > ordered, before the commit block hits the disk we know all the blocks
> > > previously submitted are also on disk.
> > >
> >
> > Oh, OK. Will the same apply to (for example) scsi?
>
> For scsi the general expectation is that write cache will be off unless
> it is battery backed. blkdev_issue_flush does go down to scsi, but I'm
> not sure about the regular WRITE_BARRIER stuff. Jens?

That's right, blkdev_issue_flush() works but barriers don't yet. The
usual error handling story.

--
Jens Axboe

2004-06-10 16:39:13

by Eric D. Mudama

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Thu, Jun 10 at 8:11, Jens Axboe wrote:
>On Thu, Jun 10 2004, Bartlomiej Zolnierkiewicz wrote:
>>
>> /me just thinks loudly
>>
>> 'linear range' FLUSH CACHE seems so easy to implement that I always wondered
>> why FLUSH CACHE command doesn't make any use of LBA address and number
>> of sectors.
>
>Indeed, that would be very helpful as well.

Neat idea... so you send us a LBA and a block count, and we return
good status if that region is flushed.

Each command can specify a 32MiB region, assuming a device with
512-byte LBAs.

Propose an exact implementation and an opcode...

--
Eric D. Mudama
[email protected]

2004-06-10 17:51:07

by Jeff Garzik

[permalink] [raw]
Subject: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

Eric D. Mudama wrote:
> On Thu, Jun 10 at 8:11, Jens Axboe wrote:
>
>> On Thu, Jun 10 2004, Bartlomiej Zolnierkiewicz wrote:
>>
>>>
>>> /me just thinks loudly
>>>
>>> 'linear range' FLUSH CACHE seems so easy to implement that I always
>>> wondered
>>> why FLUSH CACHE command doesn't make any use of LBA address and number
>>> of sectors.
>>
>>
>> Indeed, that would be very helpful as well.
>
>
> Neat idea... so you send us a LBA and a block count, and we return
> good status if that region is flushed.
>
> Each command can specify a 32MiB region, assuming a device with
> 512-byte LBAs.
>
> Propose an exact implementation and an opcode...


Ok, I'll give it a shot:

1) IDENTIFY DEVICE, Word 206, Command set/feature supported

bit 15: shall be cleared to zero
bit 14: shall be set to one
bits 13:1: reserved
bits 0: 1 == flush cache (range) supported

Word 206:

If bit 0 is set to one, the mandatory FLUSH CACHE and FLUSH CACHE EXT
commands (if implemented) support the RANGE bit, and user-supplied LBA
and sector count specifying the limits of the cache flush. This bit
merely identifies the presence of this feature. Use word 207, bit 0, to
determine if the feature is enabled.


2) IDENTIFY DEVICE, Word 207, Command set/feature enabled

bit 15: shall be cleared to zero
bit 14: shall be set to one
bits 13:1: reserved
bits 0: 1 == flush cache (range) enabled

Word 206:

If bit 0 is set to one, the mandatory FLUSH CACHE and FLUSH CACHE EXT
commands (if implemented) support the RANGE bit, and user-supplied LBA
and sector count specifying the limits of the cache flush.



3) Modify FLUSH CACHE (E7h) as follows:

Inputs:
-------

Features: bit 0 (RANGE)
Sector Count: sector count
LBA Low: LBA(7:0)
LBA Mid: LBA(15:8)
LBA High: LBA(23:16)

Features -
If the RANGE bit is set, the cache flush operation shall be considered
to be limited to the region specified in Sector Count / LBA registers.
If the RANGE bit is not set, or the implementation does not support the
RANGE bit, then the FLUSH CACHE operation shall flush the entire cache.

Sector Count -
Maximum number of sectors to be flushed from the cache. A value of 00h
specified that 256 sectors are to be flushed.

LBA Low / Mid / High -
An LBA starting address for the flush. Register contents as specified
in READ DMA command, and other commands.


Normal outputs:
---------------
Unchanged.


Error outputs:
--------------
Error register -

RANGE (bit 0) shall be set to one, if RANGE bit was specified in
Features register when the command was submitted, indicating this was a
range-based flush cache.


Description
-----------
If RANGE bit is set to one, the flush cache operation at a minimum shall
flush the specified range of sectors specified by LBA / Sector Count.
An implementation may choose to flush more than the specified range, up
to an entire cache flush in a compatible or "no op" implementation.

If no data within the specified LBA range exists in cache to be flushed,
that shall not be considered an error.


4) Modify FLUSH CACHE EXT (EAh) as follows:

Inputs:
-------

Features:
curr bit 0 (RANGE)
prev na
Sector Count:
curr sector count(7:0)
prev sector count(15:8)
LBA Low:
curr LBA(7:0)
prev LBA(31:24)
LBA Mid:
curr LBA(15:8)
prev LBA(39:32)
LBA High:
curr LBA(23:16)
prev LBA(47:40)

Features -
If the RANGE bit is set, the cache flush operation shall be considered
to be limited to the region specified in Sector Count / LBA registers.
If the RANGE bit is not set, or the implementation does not support the
RANGE bit, then the FLUSH CACHE operation shall flush the entire cache.

Sector Count -
Maximum number of sectors to be flushed from the cache. A value of
0000h specified that 65,536 sectors are to be flushed.

LBA Low / Mid / High -
An LBA starting address for the flush. Register contents as specified
in READ DMA command, and other commands.


Normal outputs:
---------------
Unchanged.


Error outputs:
--------------
Error register -

RANGE (bit 0) shall be set to one, if RANGE bit was specified in
Features register when the command was submitted, indicating this was a
range-based flush cache.


Description
-----------
If RANGE bit is set to one, the flush cache operation at a minimum shall
flush the specified range of sectors specified by LBA / Sector Count.
An implementation may choose to flush more than the specified range, up
to an entire cache flush in a compatible or "no op" implementation.

If no data within the specified LBA range exists in cache to be flushed,
that shall not be considered an error.


</proposal>


Comments requested.

We need to KISS, if this proposal has any hope getting accepted.

People interested in filesystem journalling, barriers and such please
review.

Once people on lkml are happy, I'll write this up in a PDF in T13 form,
and propose it on the T13 list (making sure everyone involved is
properly credited, of course).

Jeff


2004-06-10 18:02:55

by Jeff Garzik

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

Oh, also:

We'll need to write up precisely _why_ this is used, and give some
examples of usage, for people reading the proposal (mostly T13-ish
people) who have not been following the lkml barrier discussion closely.

Jeff



2004-06-10 20:30:56

by Eric D. Mudama

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Thu, Jun 10 at 14:02, Jeff Garzik wrote:
>Oh, also:
>
>We'll need to write up precisely _why_ this is used, and give some
>examples of usage, for people reading the proposal (mostly T13-ish
>people) who have not been following the lkml barrier discussion closely.

One comment...

There will need to be queued versions of this command, both legacy
and first-party, since a flush cache command will abort an outstanding
queue with error.

Second, I'm trying to figure out exactly how this might be used...

Would the driver just send down alternating write/flushregion commands
queued? If that is the case, the drive will offer 2x the queue depth
(maybe 30% more performance) doing pure WRITE DMA QUEUED FUA (FP)
commands, wouldn't it? Then again, for a metadata-only journaling
system, this would give you almost 100% of raw performance, with
metadata reliability which means you could always boot the drive.

I'm not sure what percentage of the writes to the filesystem one might
envision doing with this system...

--eric




--
Eric D. Mudama
[email protected]

2004-06-11 06:09:12

by Stuart Young

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Fri, 11 Jun 2004 03:50, Jeff Garzik wrote:

> 2) IDENTIFY DEVICE, Word 207, Command set/feature enabled
>
> bit 15: shall be cleared to zero
> bit 14: shall be set to one
> bits 13:1: reserved
> bits 0: 1 == flush cache (range) enabled
>
> Word 206:
>
> If bit 0 is set to one, the mandatory FLUSH CACHE and FLUSH CACHE EXT
> commands (if implemented) support the RANGE bit, and user-supplied LBA
> and sector count specifying the limits of the cache flush.

Shouldn't that be "Word 207:" up there?

--
Stuart Young (aka Cef)
[email protected] is for LKML and related email only

2004-06-11 08:11:52

by Jens Axboe

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Thu, Jun 10 2004, Jeff Garzik wrote:
> Oh, also:
>
> We'll need to write up precisely _why_ this is used, and give some
> examples of usage, for people reading the proposal (mostly T13-ish
> people) who have not been following the lkml barrier discussion closely.

Proposal looks fine, but please lets not forget that flush cache range
is really a band-aid because we don't have a proper ordered write in the
first place. Personally, I'd much rather see that implemented than flush
cache range. It would be way more effective.

--
Jens Axboe

2004-06-11 16:14:09

by Eric D. Mudama

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Fri, Jun 11 at 9:55, Jens Axboe wrote:
>Proposal looks fine, but please lets not forget that flush cache range
>is really a band-aid because we don't have a proper ordered write in the
>first place. Personally, I'd much rather see that implemented than flush
>cache range. It would be way more effective.

So something like:

WRITE FIRST PARTY DMA QUEUED BARRIER EXT
READ FIRST PARTY DMA QUEUED BARRIER EXT
READ DMA QUEUED BARRIER EXT
READ DMA QUEUED BARRIER
WRITE DMA QUEUED BARRIER
WRITE DMA QUEUED BARRIER EXT


...

If the drive receives a queued barrier write (NCQ or Legacy), it will
finish processing all previously-received queued commands and post
good status for them, then it will process the barrier operation, post
status for that barrier operation, then it will continue processing
queued commands in the order received.

Multiple barrier operations can be in the queue at the same time. A
barrier operation has an implied FUA associated with it, such that the
command (and all previous-in-time commands) must be pushed to the
media before command completetion can be indicated.


Is that what would be most useful?

--eric



--
Eric D. Mudama
[email protected]

2004-06-11 16:30:09

by Jeff Garzik

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

Eric D. Mudama wrote:
> On Thu, Jun 10 at 14:02, Jeff Garzik wrote:
>
>> Oh, also:
>>
>> We'll need to write up precisely _why_ this is used, and give some
>> examples of usage, for people reading the proposal (mostly T13-ish
>> people) who have not been following the lkml barrier discussion closely.
>
>
> One comment...
>
> There will need to be queued versions of this command, both legacy
> and first-party, since a flush cache command will abort an outstanding
> queue with error.
>
> Second, I'm trying to figure out exactly how this might be used...


With flush cache range, it should perform exactly the same as a
traditional flush cache with regards to queueing-related issues.

ATA drives normally used without queueing would be a target for this
type of proposal. Hopefully lower-end devices would not have trouble
implementing flush cache (range).

Moving forward, in ATA TCQ is largely a transitory step to NCQ.

As such, Linux will probably have two methods of implementing barriers,
the "pre-NCQ" method, and the NCQ method.

The pre-NCQ method most certainly involves heavy flush cache use.

The NCQ method, I presume, would involve almost exclusive use of queued
FUA commands. That way the OS knows precisely what has not yet hit the
platter, on all data writes to disk. It needs only to wait for queue
completion before submitting the commit block (sector), also FUA.

Jeff


2004-06-11 16:33:47

by Jeff Garzik

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

Jens Axboe wrote:
> On Thu, Jun 10 2004, Jeff Garzik wrote:
>
>>Oh, also:
>>
>>We'll need to write up precisely _why_ this is used, and give some
>>examples of usage, for people reading the proposal (mostly T13-ish
>>people) who have not been following the lkml barrier discussion closely.
>
>
> Proposal looks fine, but please lets not forget that flush cache range
> is really a band-aid because we don't have a proper ordered write in the
> first place. Personally, I'd much rather see that implemented than flush
> cache range. It would be way more effective.


Certainly agreed, and that was the gist of the reply just sent to Eric:
moving forward, implementing barriers should be done with new "NCQ"
commands and FUA, or something along those lines.

New drives will continue to come out that aren't in the NCQ class for a
while yet, though.

Jeff


2004-06-11 16:37:24

by Jeff Garzik

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

Eric D. Mudama wrote:
> On Fri, Jun 11 at 9:55, Jens Axboe wrote:
>
>> Proposal looks fine, but please lets not forget that flush cache range
>> is really a band-aid because we don't have a proper ordered write in the
>> first place. Personally, I'd much rather see that implemented than flush
>> cache range. It would be way more effective.
>
>
> So something like:
>
> WRITE FIRST PARTY DMA QUEUED BARRIER EXT
> READ FIRST PARTY DMA QUEUED BARRIER EXT
> READ DMA QUEUED BARRIER EXT
> READ DMA QUEUED BARRIER
> WRITE DMA QUEUED BARRIER
> WRITE DMA QUEUED BARRIER EXT

Honestly, Linux at least isn't going to care about "legacy TCQ" at all,
unless in the very rare case that the controller implements TCQ support
in hardware.

The overall difficulty with implementing atomic updates, journalling,
barriers etc. on ATA is that traditionally the OS had no clue what was
in the write cache, and what was actually on the platter.

Thus, I think that an FPDMA queued FUA read/write should be all that's
needed, since that automatically gives the OS the knowledge of ordering,
which gives barriers what they need. Ordering need only be a matter of
waiting for the hardware queue (all FUA commands) to drain, and then
issuing an FUA commit block.

Unfortunately, that's not the answer drive guys want to hear, because
FUA limits the optimization potential from previous ATA. ;-) Maybe
drive performance is high enough these days that queued-FUA as a
standard mode of operation is tolerable...


> ...
>
> If the drive receives a queued barrier write (NCQ or Legacy), it will
> finish processing all previously-received queued commands and post
> good status for them, then it will process the barrier operation, post
> status for that barrier operation, then it will continue processing
> queued commands in the order received.

If queued-FUA is out of the question, this seems quite reasonable. It
appears to achieve the commit-block semantics described for barrier
operation, AFAICS.

Jeff


2004-06-11 16:52:21

by Eric D. Mudama

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Fri, Jun 11 at 12:31, Jeff Garzik wrote:
>If queued-FUA is out of the question, this seems quite reasonable. It
>appears to achieve the commit-block semantics described for barrier
>operation, AFAICS.

Queued FUA shouldn't be out of the question.

However, Queued FUA requires waiting for the queue to drain before
sending more commands, since a pair of queued FUA commands doesn't
guarantee the ordering of those two commands, which may or may not be
acceptable semantics.

The barrier operation is basically a queueing-friendly flush+FUA,
which may be better... it lets the driver keep the queue in the drive
full, and also allows writes other than the commit block to not be
done as FUA operations, which is potentially faster. THe bigger the
ratio of data to commit block, the better the performance would be
with a barrier operation vs a purely queued FUA workload.

--eric

--
Eric D. Mudama
[email protected]

2004-06-11 16:53:18

by Jens Axboe

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Fri, Jun 11 2004, Eric D. Mudama wrote:
> On Fri, Jun 11 at 9:55, Jens Axboe wrote:
> >Proposal looks fine, but please lets not forget that flush cache range
> >is really a band-aid because we don't have a proper ordered write in the
> >first place. Personally, I'd much rather see that implemented than flush
> >cache range. It would be way more effective.
>
> So something like:
>
> WRITE FIRST PARTY DMA QUEUED BARRIER EXT
> READ FIRST PARTY DMA QUEUED BARRIER EXT
> READ DMA QUEUED BARRIER EXT
> READ DMA QUEUED BARRIER
> WRITE DMA QUEUED BARRIER
> WRITE DMA QUEUED BARRIER EXT
>
>
> ...
>
> If the drive receives a queued barrier write (NCQ or Legacy), it will
> finish processing all previously-received queued commands and post
> good status for them, then it will process the barrier operation, post
> status for that barrier operation, then it will continue processing
> queued commands in the order received.
>
> Multiple barrier operations can be in the queue at the same time. A
> barrier operation has an implied FUA associated with it, such that the
> command (and all previous-in-time commands) must be pushed to the
> media before command completetion can be indicated.
>
>
> Is that what would be most useful?

That is _spot on_ the best implementation for writes and what I have
asked for all along :-). I have nothing to add to the above.

I don't have an immediate use for the read-barrier requests (btw, I
think we should call it WRITE_DMA_QUEUED_ORDERED and so forth, clearer
naming), though.

--
Jens Axboe

2004-06-11 16:56:47

by Jens Axboe

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Fri, Jun 11 2004, Jeff Garzik wrote:
> Unfortunately, that's not the answer drive guys want to hear, because
> FUA limits the optimization potential from previous ATA. ;-) Maybe
> drive performance is high enough these days that queued-FUA as a
> standard mode of operation is tolerable...

Data integrity doesn't come for free. Take a pick :-)

> >If the drive receives a queued barrier write (NCQ or Legacy), it will
> >finish processing all previously-received queued commands and post
> >good status for them, then it will process the barrier operation, post
> >status for that barrier operation, then it will continue processing
> >queued commands in the order received.
>
> If queued-FUA is out of the question, this seems quite reasonable. It
> appears to achieve the commit-block semantics described for barrier
> operation, AFAICS.

Actually from Linux's point of view, drive may reorder previously
committed requests - just not around the barrier.

--
Jens Axboe

2004-06-11 17:07:07

by Jens Axboe

[permalink] [raw]
Subject: Re: flush cache range proposal (was Re: ide errors in 7-rc1-mm1 and later)

On Fri, Jun 11 2004, Eric D. Mudama wrote:
> On Fri, Jun 11 at 12:31, Jeff Garzik wrote:
> >If queued-FUA is out of the question, this seems quite reasonable. It
> >appears to achieve the commit-block semantics described for barrier
> >operation, AFAICS.
>
> Queued FUA shouldn't be out of the question.
>
> However, Queued FUA requires waiting for the queue to drain before
> sending more commands, since a pair of queued FUA commands doesn't
> guarantee the ordering of those two commands, which may or may not be
> acceptable semantics.

You can continue building and reordering requests behind the QUEUED_FUA
write(s).

> The barrier operation is basically a queueing-friendly flush+FUA,
> which may be better... it lets the driver keep the queue in the drive

That's exactly correct.

> full, and also allows writes other than the commit block to not be
> done as FUA operations, which is potentially faster. THe bigger the
> ratio of data to commit block, the better the performance would be
> with a barrier operation vs a purely queued FUA workload.

Just looking at how pre/write/post flush performs and I don't think it
will be that bad (it's already quite good). Depends on how sync
intensive the workload is of course.

But as long as it's the fastest possible implementation (and I think it
is), then arguing about performance is futile imo. Correctness comes
first.

--
Jens Axboe

2004-06-14 21:45:26

by Ed Tomlinson

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

Hi,

Still get the errors with 7-rc3-mm2. Will this be fixed anytime soon?

TIA
Ed

> I am still seeing these with 7-rc3-mm1... No extra diag info either. I would be
> really nice to see this one fixed.
>
> TIA
> Ed Tomlinson
>
> On June 3, 2004 10:07 pm, Ed Tomlinson wrote:
> > I am still getting these ide errors with 7-rc2-mm2. I get the errors even
> > if I mount with barrier=0 (or just defaults). It would seem that something is
> > sending my drive commands it does not understand...
> >
> > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> >
> > How can we find out what is wrong?
> >
> > This does not seem to be an error that corrupts the fs, it just slows things
> > down when it hits a group of these. Note that they keep poping up - they
> > do stop (I still get them hours after booting).
> >
> > TIA
> > Ed Tomlinson
> >
> > ----------------------
> > 7-mm4 ok
> > 7-mm5 na
> > 7-rc1-mm1 errors
> > 7-rc2 ok
> > 7-rc2-mm2 errors
> >
> > CONFIG_IDE=y
> > CONFIG_BLK_DEV_IDE=y
> >
> > #
> > # Please see Documentation/ide.txt for help/info on IDE drives
> > #
> > # CONFIG_BLK_DEV_HD_IDE is not set
> > CONFIG_BLK_DEV_IDEDISK=y
> > CONFIG_IDEDISK_MULTI_MODE=y
> > # CONFIG_IDEDISK_STROKE is not set
> > CONFIG_BLK_DEV_IDECD=m
> > CONFIG_BLK_DEV_IDETAPE=m
> > # CONFIG_BLK_DEV_IDEFLOPPY is not set
> > CONFIG_BLK_DEV_IDESCSI=m
> > # CONFIG_IDE_TASK_IOCTL is not set
> > CONFIG_IDE_TASKFILE_IO=y
> >
> > #
> > # IDE chipset support/bugfixes
> > #
> > CONFIG_IDE_GENERIC=y
> > # CONFIG_BLK_DEV_CMD640 is not set
> > CONFIG_BLK_DEV_IDEPNP=y
> > CONFIG_BLK_DEV_IDEPCI=y
> > CONFIG_IDEPCI_SHARE_IRQ=y
> > # CONFIG_BLK_DEV_OFFBOARD is not set
> > # CONFIG_BLK_DEV_GENERIC is not set
> > # CONFIG_BLK_DEV_OPTI621 is not set
> > # CONFIG_BLK_DEV_RZ1000 is not set
> > CONFIG_BLK_DEV_IDEDMA_PCI=y
> > # CONFIG_BLK_DEV_IDEDMA_FORCED is not set
> > CONFIG_IDEDMA_PCI_AUTO=y
> > # CONFIG_IDEDMA_ONLYDISK is not set
> > CONFIG_BLK_DEV_ADMA=y
> > # CONFIG_BLK_DEV_AEC62XX is not set
> > # CONFIG_BLK_DEV_ALI15X3 is not set
> > # CONFIG_BLK_DEV_AMD74XX is not set
> > # CONFIG_BLK_DEV_ATIIXP is not set
> > # CONFIG_BLK_DEV_CMD64X is not set
> > # CONFIG_BLK_DEV_TRIFLEX is not set
> > # CONFIG_BLK_DEV_CY82C693 is not set
> > # CONFIG_BLK_DEV_CS5520 is not set
> > # CONFIG_BLK_DEV_CS5530 is not set
> > # CONFIG_BLK_DEV_HPT34X is not set
> > # CONFIG_BLK_DEV_HPT366 is not set
> > # CONFIG_BLK_DEV_SC1200 is not set
> > CONFIG_BLK_DEV_PIIX=y
> > # CONFIG_BLK_DEV_NS87415 is not set
> > # CONFIG_BLK_DEV_PDC202XX_OLD is not set
> > # CONFIG_BLK_DEV_PDC202XX_NEW is not set
> > # CONFIG_BLK_DEV_SVWKS is not set
> > # CONFIG_BLK_DEV_SIIMAGE is not set
> > # CONFIG_BLK_DEV_SIS5513 is not set
> > # CONFIG_BLK_DEV_SLC90E66 is not set
> > # CONFIG_BLK_DEV_TRM290 is not set
> > # CONFIG_BLK_DEV_VIA82CXXX is not set
> > # CONFIG_IDE_ARM is not set
> > # CONFIG_IDE_CHIPSETS is not set
> > CONFIG_BLK_DEV_IDEDMA=y
> > # CONFIG_IDEDMA_IVB is not set
> > CONFIG_IDEDMA_AUTO=y
> >
> > > Think this is not just a barrier problem (unless barrier is the default).
> > > One if my two drives gets the error below during operation.
> > > The drive is the root drive and is mounted with defaults. 2.6.6-mm4
> > > was the last kernel booted on this box. The 2.6.7-rc1-mm1 was compiled
> > > with 2.95 with the following fs options:
> > >
> > > CONFIG_EXT2_FS=y
> > > # CONFIG_EXT2_FS_XATTR is not set
> > > CONFIG_EXT3_FS=m
> > > # CONFIG_EXT3_FS_XATTR is not set
> > > CONFIG_JBD=m
> > > # CONFIG_JBD_DEBUG is not set
> > > CONFIG_REISERFS_FS=y
> > > # CONFIG_REISERFS_CHECK is not set
> > > # CONFIG_REISERFS_PROC_INFO is not set
> > > # CONFIG_REISERFS_FS_XATTR is not set
> > > # CONFIG_JFS_FS is not set
> > > # CONFIG_XFS_FS is not set
> > > # CONFIG_MINIX_FS is not set
> > > # CONFIG_ROMFS_FS is not set
> > > # CONFIG_QUOTA is not set
> > > # CONFIG_AUTOFS_FS is not set
> > > CONFIG_AUTOFS4_FS=m
> >
> > Disk /dev/hda: 6448 MB, 6448619520 bytes
> > 240 heads, 63 sectors/track, 833 cylinders
> > Units = cylinders of 15120 * 512 = 7741440 bytes
> >
> > Device Boot Start End Blocks Id System
> > /dev/hda1 1 99 748408+ 82 Linux swap
> > /dev/hda2 100 108 68040 83 Linux
> > /dev/hda3 * 109 833 5481000 83 Linux
> >
> > > hda reports:
> > > root@bert:/usr/src/linux# hdparm -iI /dev/hda
> > >
> > > /dev/hda:
> > >
> > > Model=WDC AC26400R, FwRev=15.01J15, SerialNo=WD-WM6271600165
> > > Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
> > > RawCHS=13328/15/63, TrkSize=57600, SectSize=600, ECCbytes=40
> > > BuffType=DualPortCache, BuffSize=512kB, MaxMultSect=16, MultSect=16
> > > CurCHS=13328/15/63, CurSects=12594960, LBA=yes, LBAsects=12594960
> > > IORDY=on/off, tPIO={min:160,w/IORDY:120}, tDMA={min:120,rec:120}
> > > PIO modes: pio0 pio1 pio2 pio3 pio4
> > > DMA modes: mdma0 mdma1 mdma2
> > > UDMA modes: udma0 udma1 *udma2 udma3 udma4
> > > AdvancedPM=no WriteCache=enabled
> > > Drive conforms to: device does not report version: 1 2 3 4
> > >
> > > * signifies the current active mode
> > >
> > >
> > > ATA device, with non-removable media
> > > Model Number: WDC AC26400R
> > > Serial Number: WD-WM6271600165
> > > Firmware Revision: 15.01J15
> > > Standards:
> > > Supported: 4 3 2 1
> > > Likely used: 4
> > > Configuration:
> > > Logical max current
> > > cylinders 13328 13328
> > > heads 15 15
> > > sectors/track 63 63
> > > --
> > > bytes/track: 57600 bytes/sector: 600
> > > CHS current addressable sectors: 12594960
> > > LBA user addressable sectors: 12594960
> > > device size with M = 1024*1024: 6149 MBytes
> > > device size with M = 1000*1000: 6448 MBytes (6 GB)
> > > Capabilities:
> > > LBA, IORDY(can be disabled)
> > > Buffer size: 512.0kB bytes avail on r/w long: 40 Queue depth: 1
> > > Standby timer values: spec'd by Standard, no device specific minimum
> > > R/W multiple sector transfer: Max = 16 Current = 16
> > > DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4
> > > Cycle time: min=120ns recommended=120ns
> > > PIO: pio0 pio1 pio2 pio3 pio4
> > > Cycle time: no flow control=160ns IORDY flow control=120ns
> > > Commands/features:
> > > Enabled Supported:
> > > * READ BUFFER cmd
> > > * WRITE BUFFER cmd
> > > * Look-ahead
> > > * Write cache
> > > * Power Management feature set
> > > * SMART feature set
> > >
> > > root@bert:/usr/src/linux# hdparm -iI /dev/hdb
> > >
> > > /dev/hdb:
> > >
> > > Model=Maxtor 6E030L0, FwRev=NAR61590, SerialNo=E178CV5E
> > > Config={ Fixed }
> > > RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=57
> > > BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
> > > CurCHS=17475/15/63, CurSects=16513875, LBA=yes, LBAsects=60058656
> > > IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> > > PIO modes: pio0 pio1 pio2 pio3 pio4
> > > DMA modes: mdma0 mdma1 mdma2
> > > UDMA modes: udma0 udma1 *udma2 udma3 udma4 udma5 udma6
> > > AdvancedPM=yes: disabled (255) WriteCache=enabled
> > > Drive conforms to: (null):
> > >
> > > * signifies the current active mode
> > >
> > >
> > > ATA device, with non-removable media
> > > Model Number: Maxtor 6E030L0
> > > Serial Number: E178CV5E
> > > Firmware Revision: NAR61590
> > > Standards:
> > > Supported: 7 6 5 4
> > > Likely used: 7
> > > Configuration:
> > > Logical max current
> > > cylinders 16383 17475
> > > heads 16 15
> > > sectors/track 63 63
> > > --
> > > CHS current addressable sectors: 16513875
> > > LBA user addressable sectors: 60058656
> > > device size with M = 1024*1024: 29325 MBytes
> > > device size with M = 1000*1000: 30750 MBytes (30 GB)
> > > Capabilities:
> > > LBA, IORDY(can be disabled)
> > > Queue depth: 1
> > > Standby timer values: spec'd by Standard, no device specific minimum
> > > R/W multiple sector transfer: Max = 16 Current = 16
> > > Advanced power management level: unknown setting (0x0000)
> > > Recommended acoustic management value: 192, current value: 254
> > > DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 udma6
> > > Cycle time: min=120ns recommended=120ns
> > > PIO: pio0 pio1 pio2 pio3 pio4
> > > Cycle time: no flow control=120ns IORDY flow control=120ns
> > > Commands/features:
> > > Enabled Supported:
> > > * NOP cmd
> > > * READ BUFFER cmd
> > > * WRITE BUFFER cmd
> > > * Host Protected Area feature set
> > > * Look-ahead
> > > * Write cache
> > > * Power Management feature set
> > > Security Mode feature set
> > > * SMART feature set
> > > * FLUSH CACHE EXT command
> > > * Mandatory FLUSH CACHE command
> > > * Device Configuration Overlay feature set
> > > * Automatic Acoustic Management feature set
> > > SET MAX security extension
> > > Advanced Power Management feature set
> > > * DOWNLOAD MICROCODE cmd
> > > * SMART self-test
> > > * SMART error logging
> > > Security:
> > > Master password revision code = 65534
> > > supported
> > > not enabled
> > > not locked
> > > not frozen
> > > not expired: security count
> > > not supported: enhanced erase
> > > HW reset results:
> > > CBLID- above Vih
> > > Device num = 1 determined by CSEL
> > > Checksum: correct
> > >
> > > hdb is accessed via dm and evms. This is what the boot of reports:
> > >
> > > May 27 18:17:39 bert kernel: Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> > > May 27 18:17:39 bert kernel: ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> > > May 27 18:17:39 bert kernel: PIIX4: IDE controller at PCI slot 0000:00:14.1
> > > May 27 18:17:39 bert kernel: PIIX4: chipset revision 1
> > > May 27 18:17:39 bert kernel: PIIX4: not 100%% native mode: will probe irqs later
> > > May 27 18:17:39 bert kernel: ide0: BM-DMA at 0x10c0-0x10c7, BIOS settings: hda:pio, hdb:DMA
> > > May 27 18:17:39 bert kernel: ide1: BM-DMA at 0x10c8-0x10cf, BIOS settings: hdc:DMA, hdd:pio
> > > May 27 18:17:39 bert kernel: hda: WDC AC26400R, ATA DISK drive
> > > May 27 18:17:39 bert kernel: hdb: Maxtor 6E030L0, ATA DISK drive
> > > May 27 18:17:39 bert kernel: Using anticipatory io scheduler
> > > May 27 18:17:39 bert kernel: ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> > > May 27 18:17:39 bert kernel: hdc: HL-DT-ST RW/DVD GCC-4480B, ATAPI CD/DVD-ROM drive
> > > May 27 18:17:39 bert kernel: ide1 at 0x170-0x177,0x376 on irq 15
> > > May 27 18:17:39 bert kernel: pnp: the driver 'ide' has been registered
> > > May 27 18:17:39 bert kernel: hda: max request size: 128KiB
> > > May 27 18:17:39 bert kernel: hda: 12594960 sectors (6448 MB) w/512KiB Cache, CHS=13328/15/63, UDMA(33)
> > > May 27 18:17:39 bert kernel: hda: cache flushes supported
> > > May 27 18:17:39 bert kernel: hda: hda1 hda2 hda3
> > > May 27 18:17:39 bert kernel: hdb: max request size: 128KiB
> > > May 27 18:17:39 bert kernel: hdb: 60058656 sectors (30750 MB) w/2048KiB Cache, CHS=59582/16/63, UDMA(33)
> > > May 27 18:17:39 bert kernel: hdb: cache flushes supported
> > > May 27 18:17:39 bert kernel: hdb: hdb1 hdb2 hdb3 hdb4 < hdb5 >
> > >
> > > followed later by:
> > >
> > > May 27 18:18:05 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:18:05 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > May 27 18:18:06 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:18:06 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > May 27 18:19:21 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:19:21 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > May 27 18:19:22 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:19:22 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > May 27 18:20:01 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:20:01 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > May 27 18:20:01 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:20:01 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > > May 27 18:21:27 bert kernel: hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > May 27 18:21:27 bert kernel: hda: drive_cmd: error=0x04 { DriveStatusError }
> > >
> > >
> > >
> > >
> > > Hope this help,
> > >
> > > Ed
> > >
> > > On May 27, 2004 04:24 pm, G?nther Persoons wrote:
> > > > Hey,
> > > > When i mount my reiser partitie with the option barrier=flush i get
> > > > following message and error:
> > > > My harddrive is a 2.5 inch Fujitsu 20GB IDE.
> > > >
> > > > mount /dev/hda10 /tmp -o barrier=flush
> > > > mount: wrong fs type, bad option, bad superblock on /dev/hda10,
> > > > or too many mounted file systems
> > > > Log:
> > > > ReiserFS: hda10: found reiserfs format "3.6" with standard journal
> > > > ReiserFS: hda10: using ordered data mode
> > > > reiserfs: using flush barriers
> > > > ReiserFS: hda10: journal params: device hda10, size 8192, journal first
> > > > block 18, max trans len 1024, max batch 900, max commit age 30, max
> > > > trans age 30
> > > > ReiserFS: hda10: checking transaction log (hda10)
> > > > hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
> > > > hda: drive_cmd: error=0x04 { DriveStatusError }
> > > > hda: barrier support doesn't work
> > > > ReiserFS: hda10: warning: journal-837: IO error during journal replay
> > > > ReiserFS: hda10: warning: Replay Failure, unable to mount
> > > > ReiserFS: hda10: warning: sh-2022: reiserfs_fill_super: unable to
> > > > initialize journal space
> > >
> >
> >
>

2004-06-26 08:42:08

by Andre Hedrick

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later


Eric,

There is no need for a new opcode.
The behavior is simple and trivial to support.

If standard flush_cache/ext were to behave just like standard data_in
taskfile register setup, yet use a non_data command state machine it would
be done.

Special case would be deal with LBA Zero and this would have to behave
like a complete device flush. Since flushing sector zero is not generally
done ... well this would go into a design debate and it is not my issue
nor my desire to enter one today.

28-bit would support max 256 sectors
48-bit would support max 65536 sectors

Anyone could write this simple proposal to T13 for SATA and T10 for SAS.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Thu, 10 Jun 2004, Eric D. Mudama wrote:

> On Thu, Jun 10 at 8:11, Jens Axboe wrote:
> >On Thu, Jun 10 2004, Bartlomiej Zolnierkiewicz wrote:
> >>
> >> /me just thinks loudly
> >>
> >> 'linear range' FLUSH CACHE seems so easy to implement that I always wondered
> >> why FLUSH CACHE command doesn't make any use of LBA address and number
> >> of sectors.
> >
> >Indeed, that would be very helpful as well.
>
> Neat idea... so you send us a LBA and a block count, and we return
> good status if that region is flushed.
>
> Each command can specify a 32MiB region, assuming a device with
> 512-byte LBAs.
>
> Propose an exact implementation and an opcode...
>
> --
> Eric D. Mudama
> [email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-06-26 09:09:22

by Andre Hedrick

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later


One more thing ...

Barriers are useless unless always used.
Given drives will autoflush cache and should an error occur, then a manual
flush cache is issued the drive goes offline. The state machine is in an
error and can not be recovered. So actually there needs to be an error
state test opcode or something first.

However, since barriers generally are not wrapped around all writes
associated with the entire transaction (ie transaction is larger than
total sectors transferred), multiple writes, flush(1), down block,
flush(2) for journalling ...

Failure on flush(1) could play havoc if there were cached writes from
previous non journalled transactions to the device. This kind of recovery
would require block_request_aging similar to VM paging (clean/dirty lists)
and has the potential to be come ugly.

I have the full model to deploy, but I have no time or desire to champion
the project.

Regards,

Andre Hedrick
LAD Storage Consulting Group


On Sat, 26 Jun 2004, Andre Hedrick wrote:

>
> Eric,
>
> There is no need for a new opcode.
> The behavior is simple and trivial to support.
>
> If standard flush_cache/ext were to behave just like standard data_in
> taskfile register setup, yet use a non_data command state machine it would
> be done.
>
> Special case would be deal with LBA Zero and this would have to behave
> like a complete device flush. Since flushing sector zero is not generally
> done ... well this would go into a design debate and it is not my issue
> nor my desire to enter one today.
>
> 28-bit would support max 256 sectors
> 48-bit would support max 65536 sectors
>
> Anyone could write this simple proposal to T13 for SATA and T10 for SAS.
>
> Cheers,
>
> Andre Hedrick
> LAD Storage Consulting Group
>
> On Thu, 10 Jun 2004, Eric D. Mudama wrote:
>
> > On Thu, Jun 10 at 8:11, Jens Axboe wrote:
> > >On Thu, Jun 10 2004, Bartlomiej Zolnierkiewicz wrote:
> > >>
> > >> /me just thinks loudly
> > >>
> > >> 'linear range' FLUSH CACHE seems so easy to implement that I always wondered
> > >> why FLUSH CACHE command doesn't make any use of LBA address and number
> > >> of sectors.
> > >
> > >Indeed, that would be very helpful as well.
> >
> > Neat idea... so you send us a LBA and a block count, and we return
> > good status if that region is flushed.
> >
> > Each command can specify a 32MiB region, assuming a device with
> > 512-byte LBAs.
> >
> > Propose an exact implementation and an opcode...
> >
> > --
> > Eric D. Mudama
> > [email protected]
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-06-28 18:14:56

by Eric D. Mudama

[permalink] [raw]
Subject: Re: ide errors in 7-rc1-mm1 and later

On Sat, Jun 26 at 1:31, Andre Hedrick wrote:
>
>Eric,
>
>There is no need for a new opcode.
>The behavior is simple and trivial to support.
>
>If standard flush_cache/ext were to behave just like standard data_in
>taskfile register setup, yet use a non_data command state machine it would
>be done.
>
>Special case would be deal with LBA Zero and this would have to behave
>like a complete device flush. Since flushing sector zero is not generally
>done ... well this would go into a design debate and it is not my issue
>nor my desire to enter one today.
>
>28-bit would support max 256 sectors
>48-bit would support max 65536 sectors
>
>Anyone could write this simple proposal to T13 for SATA and T10 for SAS.

True, that would work just as well.

But as you mention, it isn't necessarilly what people want or think
they want or could actually use...

--
Eric D. Mudama
[email protected]